# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [1]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [2]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [4]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [5]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-03-01 17:50:25--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘john_wick_1.csv’


2025-03-01 17:50:25 (6.11 MB/s) - ‘john_wick_1.csv’ saved [19628/19628]

--2025-03-01 17:50:26--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘john_wick_2.csv’


2025-03-01 17:50:26 (5.37 MB/s) - ‘john_wick_2.csv’

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [7]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 2, 26, 17, 50, 27, 386953)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [9]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [10]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [12]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [13]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [14]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick.'

In [15]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [16]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In "John Wick," the storyline follows an ex-hit-man who comes out of retirement to seek vengeance against the gangsters who killed his dog and took everything from him. The film is filled with action-packed shootouts, breathtaking fights, and a suspenseful story of revenge and redemption.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [17]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [18]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [19]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People's opinions on John Wick seem to vary. Some loved the action sequences, the world-building, and Keanu Reeves' performance, while others found the movie lacking in substance, plot, and character development. So, it really depends on individual preferences whether people generally liked John Wick or not."

In [20]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. Here is the URL to that review: https://www.imdb.com/review/rw8946038/?ref_=tt_urv'

In [21]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'John Wick is a movie known for its beautifully choreographed action scenes and emotional setup. It features Keanu Reeves in the lead role. If you love action movies, you will enjoy this film.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [22]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [23]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [24]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided in the context.'

In [25]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. Here is the URL to that review: "/review/rw4854296/?ref_=tt_urv"'

In [26]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, after resolving issues with the Russian mafia, John Wick refuses to help a mobster named Santino D'Antonio. In retaliation, Santino blows up Wick's house. Wick then meets with Winston, the owner of the Continental hotel in New York City, who tells him he must honor the marker given by Santino. Santino asks Wick to kill his sister in Rome so he can take her place in criminal organizations. After completing the assignment, Santino puts a seven-million dollar contract on Wick, leading to professional killers coming after him. Wick decides to seek revenge on Santino."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [27]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [28]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Yes, people generally liked John Wick. The reviews highlight the slickness of Keanu Reeves' performance, the brilliance of the action sequences, and the overall entertainment value of the film."

In [30]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there is one review with a rating of 10. Here is the URL to that review:\n- '/review/rw4854296/?ref_=tt_urv'"

In [31]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In "John Wick," the main character, John Wick, a retired assassin, comes out of retirement when someone kills his dog. This event sets off a chain of events that involve a lot of carnage and revenge. John Wick is forced back into the world of assassins, facing off against various enemies while trying to settle old debts and protect himself.'

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [32]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [33]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [34]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [35]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [36]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People's opinions on John Wick seem to be divided based on the reviews provided. Some people really enjoy the series and find it consistent and well-received, while others have strong negative opinions about it. So, it seems like whether people generally like John Wick or not depends on individual preferences."

In [38]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3." The URL to that review is: \'/review/rw4854296/?ref_=tt_urv\'.'

In [39]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, a retired assassin named John Wick seeks vengeance after his dog is killed and his car is stolen. He gets dragged into a task to pay off an old debt by helping to take over the Assassin's Guild in Italy, Canada, and Manhattan, which leads to him killing many assassins."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [40]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [41]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the context provided from various reviews of John Wick movies, it seems that John Wick received positive reviews overall. The general sentiment is that people enjoyed the movies for their action sequences, Keanu Reeves' performance, the choreography, and the unique world the movies create. Most reviews praise the film for being fun, slick, and filled with stylish action scenes, making it a must-see for action fans. Therefore, it can be concluded that people generally liked John Wick."

In [43]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, one review has a rating of 10. Here is the URL to that review:\n- /review/rw4854296/?ref_=tt_urv'

In [44]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'John Wick is an ex-hitman who comes out of retirement to seek vengeance on gangsters who killed his dog and took everything from him. It follows his journey of revenge as he unleashes destruction against those who crossed him, resulting in intense action, shootouts, and thrilling fights.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [45]:
#!pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [46]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [47]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [48]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [49]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [50]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [51]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, it seems that the majority of people liked John Wick. Some reviews highlighted the movie as cool, fun, and well-done in terms of action sequences and character development. However, there was one review that mentioned the magic being gone in the third installment of the series. So overall, opinions are generally positive, with a few exceptions.'

In [52]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for "John Wick 3". The URL to that review is: \'/review/rw4854296/?ref_=tt_urv\''

In [53]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, Keanu Reeves plays the character of John Wick, a retired assassin who seeks revenge on the people who took something he loved from him. The initial premise of the movie involves someone killing John's dog, which leads to a series of events culminating in a quest for vengeance against those who wronged him. It is a fast-paced action film with stylish stunts and kinetic chaos that keeps viewers engaged throughout."

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

** My Code Here **

Working on homework 13 and creating a new branch.

In [54]:
### YOUR CODE HERE

In [55]:
import os
from getpass import getpass

In [56]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

In [57]:
import ragas
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

# Create the base models
base_llm = ChatOpenAI(model="gpt-4")
base_embeddings = OpenAIEmbeddings()

# Wrap them for ragas
generator_llm = LangchainLLMWrapper(langchain_llm=base_llm)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings=base_embeddings)

  from .autonotebook import tqdm as notebook_tqdm


Generating a golden dataset with ragas.

In [58]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(documents, testset_size=10)

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]        Node 88d8010a-5c3a-4fde-8561-96112ce822c7 does not have a summary. Skipping filtering.
Node 7e0bd6d3-439e-475a-96f3-4e0cd038177f does not have a summary. Skipping filtering.
Node e4ca0373-9bcb-4c85-aa66-184fc60f86ef does not have a summary. Skipping filtering.
Node fe43b3d3-abac-4c84-ae9f-9ead14f68951 does not have a summary. Skipping filtering.
Node 48250f8a-ef6a-49d5-bef4-c3687be08c4f does not have a summary. Skipping filtering.
Node 03e4b838-6214-47ca-911d-df5c58c4e1c0 does not have a summary. Skipping filtering.
Node 4f273a6e-258b-4ee7-8dff-344332924350 does not have a summary. Skipping filtering.
Node 7c3a2ef5-67c5-4b64-ab72-7e09f33cb674 does not have a summary. Skipping filtering.
Node 2721cf41-8065-4a5b-bdb5-2f12f9fc063a does not have a summary. Skipping filtering.
Node c3bbdeb4-ad4c-43b2-ae49-c404d4730e02 does not have a summary. Skipping filtering.
Node f4ed672e-93c7-42cc-b298-75cf965c5136 does not have 

In [None]:
dataset.to_pandas()

Testing naive retrieval with ragas.

In [60]:
for test_row in dataset:
  response = naive_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [61]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [62]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

In [92]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig
from ragas.cost import get_token_usage_for_openai

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config,
    token_usage_parser=get_token_usage_for_openai,
)
result

Evaluating: 100%|██████████| 60/60 [06:35<00:00,  6.59s/it]


{'context_recall': 0.8750, 'faithfulness': 0.7710, 'factual_correctness': 0.4390, 'answer_relevancy': 0.8672, 'context_entity_recall': 0.6625, 'noise_sensitivity_relevant': 0.3172}

In [93]:
result.total_tokens()


TokenUsage(input_tokens=367788, output_tokens=81215, model='')

Evaluate bm25 retrieval with ragas.

In [98]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(documents, testset_size=10)

dataset.to_pandas()

for test_row in dataset:
  response = bm25_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))


from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig
from ragas.cost import get_token_usage_for_openai

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config,
    token_usage_parser=get_token_usage_for_openai,
)
result

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]        Node 5fb08da3-58b3-480f-ad99-54c6516f4358 does not have a summary. Skipping filtering.
Node 82875d90-ef4e-4fa8-be37-a6ece4e9045f does not have a summary. Skipping filtering.
Node 6dc20355-4eb2-4f9b-95f1-4c223aa5817f does not have a summary. Skipping filtering.
Node 8276dfba-aa6c-41fd-887a-d05d830f6183 does not have a summary. Skipping filtering.
Node 03ada8ee-5d2f-4358-9715-d93b761913f4 does not have a summary. Skipping filtering.
Node 42ed666f-f22c-40fa-bde8-0a9f44db76d5 does not have a summary. Skipping filtering.
Node ff5da19c-b015-422c-bc24-feab3a2232bd does not have a summary. Skipping filtering.
Node f5ee8aef-3a3b-4389-8a69-c306024b4ddb does not have a summary. Skipping filtering.
Node 21e1c5c2-55cf-4cd2-a3f9-3e23eaeb900b does not have a summary. Skipping filtering.
Node ed5b3489-4b1c-4f2f-933c-05ab3f3a2c20 does not have a summary. Skipping filtering.
Node 549fadab-ec8b-4292-bc75-620121d7f901 does not have 

{'context_recall': 0.7700, 'faithfulness': 0.6618, 'factual_correctness': 0.4140, 'answer_relevancy': 0.6647, 'context_entity_recall': 0.6861, 'noise_sensitivity_relevant': 0.2708}

In [99]:
result.total_tokens()

TokenUsage(input_tokens=213266, output_tokens=67759, model='')

Evaluate contextual compression with ragas.


In [104]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

evaluator_llm = ChatOpenAI(model="gpt-4")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : evaluator_llm})


def prepare_data(run, example):
    # Handle AIMessage output
    if isinstance(run.outputs, dict):
        if "response" in run.outputs:
            prediction = run.outputs["response"].content
        else:
            prediction = str(run.outputs)
    else:
        # Direct AIMessage
        prediction = run.outputs.content
    
    return {
        "prediction": prediction,              # Required by StringEvaluator
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
        # Keep these for qa_evaluator
        "query": example.inputs["question"],
        "result": prediction,
        "answer": example.outputs["answer"]
    }


# Create evaluators with updated prepare_data
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm": evaluator_llm},
    prepare_data=prepare_data
)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": "Is this submission helpful to the user?"
        },
        "llm": evaluator_llm
    },
    prepare_data=prepare_data
)


In [106]:
from ragas.testset import TestsetGenerator

# Create the base models
base_llm = ChatOpenAI(model="gpt-3.5-turbo")
base_embeddings = OpenAIEmbeddings()

# Wrap them for ragas
generator_llm = LangchainLLMWrapper(langchain_llm=base_llm)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings=base_embeddings)

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(documents, testset_size=10)

dataset.to_pandas()

dataset_contextual = dataset.copy()

Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]unable to apply transformation: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
Applying SummaryExtractor:   2%|▏         | 1/44 [01:22<58:55, 82.22s/it]unable to apply transformation: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
Applying SummaryExtractor:   5%|▍         | 2/44 [01:38<30:13, 43.19s/it]unable to apply transformation: Error code: 429 - {'error': {'message': 'You exceeded your current quot

ValueError: No nodes that satisfied the given filer. Try changing the filter.

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

evaluator_llm = ChatOpenAI(model="gpt-4")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : evaluator_llm})


def prepare_data(run, example):
    # Handle AIMessage output
    if isinstance(run.outputs, dict):
        if "response" in run.outputs:
            prediction = run.outputs["response"].content
        else:
            prediction = str(run.outputs)
    else:
        # Direct AIMessage
        prediction = run.outputs.content
    
    return {
        "prediction": prediction,              # Required by StringEvaluator
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
        # Keep these for qa_evaluator
        "query": example.inputs["question"],
        "result": prediction,
        "answer": example.outputs["answer"]
    }


# Create evaluators with updated prepare_data
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm": evaluator_llm},
    prepare_data=prepare_data
)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": "Is this submission helpful to the user?"
        },
        "llm": evaluator_llm
    },
    prepare_data=prepare_data
)


In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

evaluator_llm = ChatOpenAI(model="gpt-4")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : evaluator_llm})


def prepare_data(run, example):
    # Handle AIMessage output
    if isinstance(run.outputs, dict):
        if "response" in run.outputs:
            prediction = run.outputs["response"].content
        else:
            prediction = str(run.outputs)
    else:
        # Direct AIMessage
        prediction = run.outputs.content
    
    return {
        "prediction": prediction,              # Required by StringEvaluator
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
        # Keep these for qa_evaluator
        "query": example.inputs["question"],
        "result": prediction,
        "answer": example.outputs["answer"]
    }


# Create evaluators with updated prepare_data
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm": evaluator_llm},
    prepare_data=prepare_data
)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": "Is this submission helpful to the user?"
        },
        "llm": evaluator_llm
    },
    prepare_data=prepare_data
)


In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

evaluator_llm = ChatOpenAI(model="gpt-4")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : evaluator_llm})


def prepare_data(run, example):
    # Handle AIMessage output
    if isinstance(run.outputs, dict):
        if "response" in run.outputs:
            prediction = run.outputs["response"].content
        else:
            prediction = str(run.outputs)
    else:
        # Direct AIMessage
        prediction = run.outputs.content
    
    return {
        "prediction": prediction,              # Required by StringEvaluator
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
        # Keep these for qa_evaluator
        "query": example.inputs["question"],
        "result": prediction,
        "answer": example.outputs["answer"]
    }


# Create evaluators with updated prepare_data
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm": evaluator_llm},
    prepare_data=prepare_data
)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": "Is this submission helpful to the user?"
        },
        "llm": evaluator_llm
    },
    prepare_data=prepare_data
)


In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

evaluator_llm = ChatOpenAI(model="gpt-4")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : evaluator_llm})


def prepare_data(run, example):
    # Handle AIMessage output
    if isinstance(run.outputs, dict):
        if "response" in run.outputs:
            prediction = run.outputs["response"].content
        else:
            prediction = str(run.outputs)
    else:
        # Direct AIMessage
        prediction = run.outputs.content
    
    return {
        "prediction": prediction,              # Required by StringEvaluator
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
        # Keep these for qa_evaluator
        "query": example.inputs["question"],
        "result": prediction,
        "answer": example.outputs["answer"]
    }


# Create evaluators with updated prepare_data
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm": evaluator_llm},
    prepare_data=prepare_data
)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": "Is this submission helpful to the user?"
        },
        "llm": evaluator_llm
    },
    prepare_data=prepare_data
)


In [None]:
for test_row in dataset_contextual:
  response = naive_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))


from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig
from ragas.cost import get_token_usage_for_openai

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config,
    token_usage_parser=get_token_usage_for_openai,
)
result

Evaluate multi-query retrieval with ragas.

Evaluate parent document retrieval with ragas.

Evaluate parent document retrieval with ragas.

Evaluate ensemble retrieval with ragas.

Evaluate semantic chunking with ragas.

** I also tried to use LangSmith to evaluate the retrievers and used their instructions to evaluate RAGs. I used the following code to evaluate the retrievers: **

In [64]:
import os
import getpass

In [65]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("LangChain API Key:")
#os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [66]:
from langsmith import Client

client = Client(api_key=os.environ["LANGSMITH_API_KEY"])

dataset_name = "John Wick Retrieval v3"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="John Wick Retrieval"
)

In [67]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

Correctness: Reponse vs Reference Answer

In [83]:
from typing_extensions import Annotated, TypedDict

# Grade output schema
class CorrectnessGrade(TypedDict):
    # Note that the order in the fields are defined is the order in which the model will generate them.
    # It is useful to put explanations before responses because it forces the model to think through
    # its final response before generating it:
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    correct: Annotated[bool, ..., "True if the answer is correct, False otherwise."]

# Grade prompt
correctness_instructions = """You are a teacher grading a quiz. 

You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Grade the student answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the student answer does not contain any conflicting statements.
(3) It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Correctness:
A correctness value of True means that the student's answer meets all of the criteria.
A correctness value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
grader_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(CorrectnessGrade, method="json_schema", strict=True)

def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    """An evaluator for RAG answer accuracy"""
    # Get the answer from the response
    if isinstance(outputs["response"], dict):
        answer = outputs["response"].get("content", "")
    else:
        answer = outputs["response"].content
        
    answers = f"""      QUESTION: {inputs['question']}
GROUND TRUTH ANSWER: {reference_outputs['answer']}
STUDENT ANSWER: {answer}"""

    # Run evaluator
    grade = grader_llm.invoke([
        {"role": "system", "content": correctness_instructions}, 
        {"role": "user", "content": answers}
    ])
    return grade["correct"]

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

evaluator_llm = ChatOpenAI(model="gpt-4")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : evaluator_llm})


def prepare_data(run, example):
    # Handle AIMessage output
    if isinstance(run.outputs, dict):
        if "response" in run.outputs:
            prediction = run.outputs["response"].content
        else:
            prediction = str(run.outputs)
    else:
        # Direct AIMessage
        prediction = run.outputs.content
    
    return {
        "prediction": prediction,              # Required by StringEvaluator
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
        # Keep these for qa_evaluator
        "query": example.inputs["question"],
        "result": prediction,
        "answer": example.outputs["answer"]
    }


# Create evaluators with updated prepare_data
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm": evaluator_llm},
    prepare_data=prepare_data
)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": "Is this submission helpful to the user?"
        },
        "llm": evaluator_llm
    },
    prepare_data=prepare_data
)


Relevance: Response vs Input

In [84]:
# Grade output schema
class RelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[bool, ..., "Provide the score on whether the answer addresses the question"]

# Grade prompt
relevance_instructions="""You are a teacher grading a quiz. 

You will be given a QUESTION and a STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Ensure the STUDENT ANSWER is concise and relevant to the QUESTION
(2) Ensure the STUDENT ANSWER helps to answer the QUESTION

Relevance:
A relevance value of True means that the student's answer meets all of the criteria.
A relevance value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
relevance_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)

# Evaluator
def relevance(inputs: dict, outputs: dict) -> bool:
    """A simple evaluator for RAG answer helpfulness."""
    # Get the answer from the response
    if isinstance(outputs["response"], dict):
        answer = outputs["response"].get("content", "")
    else:
        answer = outputs["response"].content
    
    # Format the input for grading
    answer_text = f"""      QUESTION: {inputs['question']}
STUDENT ANSWER: {answer}"""

    # Run evaluator
    grade = relevance_llm.invoke([
        {"role": "system", "content": relevance_instructions}, 
        {"role": "user", "content": answer_text}
    ])
    
    return grade["relevant"]

Groundedness: Response vs Documents

In [85]:
# Grade output schema
class GroundedGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    grounded: Annotated[bool, ..., "Provide the score on if the answer hallucinates from the documents"]

# Grade prompt
grounded_instructions = """You are a teacher grading a quiz. 

You will be given FACTS and a STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Ensure the STUDENT ANSWER is grounded in the FACTS. 
(2) Ensure the STUDENT ANSWER does not contain "hallucinated" information outside the scope of the FACTS.

Grounded:
A grounded value of True means that the student's answer meets all of the criteria.
A grounded value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM 
grounded_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(GroundedGrade, method="json_schema", strict=True)

# Evaluator
def groundedness(inputs: dict, outputs: dict) -> bool:
    """A simple evaluator for RAG answer groundedness."""
    doc_string = "

".join(doc.page_content for doc in outputs["documents"])
    answer = f"""      FACTS: {doc_string}
STUDENT ANSWER: {outputs['answer']}"""
    grade = grounded_llm.invoke([{"role": "system", "content": grounded_instructions}, {"role": "user", "content": answer}])
    return grade["grounded"]

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

evaluator_llm = ChatOpenAI(model="gpt-4")

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : evaluator_llm})


def prepare_data(run, example):
    # Handle AIMessage output
    if isinstance(run.outputs, dict):
        if "response" in run.outputs:
            prediction = run.outputs["response"].content
        else:
            prediction = str(run.outputs)
    else:
        # Direct AIMessage
        prediction = run.outputs.content
    
    return {
        "prediction": prediction,              # Required by StringEvaluator
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
        # Keep these for qa_evaluator
        "query": example.inputs["question"],
        "result": prediction,
        "answer": example.outputs["answer"]
    }


# Create evaluators with updated prepare_data
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm": evaluator_llm},
    prepare_data=prepare_data
)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": "Is this submission helpful to the user?"
        },
        "llm": evaluator_llm
    },
    prepare_data=prepare_data
)


Retrieval Relevance: Retrieved Docs vs Question

In [86]:
# Grade output schema
class RetrievalRelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[bool, ..., "True if the retrieved documents are relevant to the question, False otherwise"]

# Grade prompt
retrieval_relevance_instructions = """You are a teacher grading a quiz. 

You will be given a QUESTION and a set of FACTS provided by the student. 

Here is the grade criteria to follow:
(1) You goal is to identify FACTS that are completely unrelated to the QUESTION
(2) If the facts contain ANY keywords or semantic meaning related to the question, consider them relevant
(3) It is OK if the facts have SOME information that is unrelated to the question as long as (2) is met

Relevance:
A relevance value of True means that the FACTS contain ANY keywords or semantic meaning related to the QUESTION and are therefore relevant.
A relevance value of False means that the FACTS are completely unrelated to the QUESTION.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""

# Grader LLM
retrieval_relevance_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(RetrievalRelevanceGrade, method="json_schema", strict=True)

def retrieval_relevance(inputs: dict, outputs: dict) -> bool:
    """An evaluator for document relevance"""
    # Join documents with newline
    doc_string = "\n".join(doc.page_content for doc in outputs["context"])
    
    # Format the input for grading
    answer_text = f"""      FACTS: {doc_string}
QUESTION: {inputs['question']}"""

    # Run evaluator
    grade = retrieval_relevance_llm.invoke([
        {"role": "system", "content": retrieval_relevance_instructions}, 
        {"role": "user", "content": answer_text}
    ])
    
    return grade["relevant"]

In [87]:
from langsmith.evaluation import evaluate as langsmith_evaluate


# First read the dataset
dataset = client.read_dataset(dataset_name="John Wick Retrieval v3")

# Use langsmith_evaluate with the data parameter
results = langsmith_evaluate(
    naive_retrieval_chain.invoke,
    data="John Wick Retrieval v3",  # Use 'data' parameter with dataset name
    evaluators=[
        correctness, 
        groundedness, 
        relevance, 
        retrieval_relevance
    ],
    metadata={"revision_id": "john_wick_retrieval_naive"}
)

View the evaluation results for experiment: 'long-orange-79' at:
https://smith.langchain.com/o/306dc215-46b3-4252-a039-f86d8a073560/datasets/32d139e2-4bf0-4ca9-bb48-3caae3f80cf3/compare?selectedSessions=7dfcb96f-2994-4326-9f40-030e3b3d37cd




10it [02:06, 12.68s/it]


Evaluate contextual compression with ragas.

Evaluate multi-query retrieval with ragas.

Evaluate parent document retrieval with ragas.

Evaluate ensemble retrieval with ragas.

Evaluate semantic chunking with ragas.