# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

In [None]:
# !pip install -qU langchain langchain-openai langchain-cohere rank_bm25

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [None]:
# !pip install -qU qdrant-client

> ## This set of pinned versions works for most of the notebook
>NOTE: Not compatible with Cohere, so that step fails
>-------

In [1]:
!pip install -qU langchain==0.2.16 langchain-cohere==0.3.0 langchain-community==0.2.17 langchain-core==0.2.41 
!pip install -qU langchain-experimental==0.3.2 langchain-huggingface==0.1.0 langchain-openai==0.1.25 langchain-qdrant==0.1.3
!pip install -qU langchain-text-splitters==0.2.4 langgraph==0.2.16 langgraph-checkpoint==1.0.6 langsmith==0.1.129 ragas==0.1.20

[31mERROR: Cannot install langchain-cohere==0.3.0, langchain-core==0.2.41 and langchain==0.2.16 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: Cannot install langchain-experimental==0.3.2, langchain-huggingface==0.1.0 and langchain-openai==0.1.25 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip i

In [2]:
!pip install -qU rank_bm25
!pip install -qU qdrant-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


We'll also provide our OpenAI key, as well as our Cohere API key.

In [3]:
import os
import getpass


In [4]:

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [5]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

In [6]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your Langsmith API key here: ")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [7]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O ./data/john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O ./data/john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O ./data/john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O ./data/john_wick_4.csv

--2024-10-01 13:51:56--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘./data/john_wick_1.csv’


2024-10-01 13:51:57 (8.80 MB/s) - ‘./data/john_wick_1.csv’ saved [19628/19628]

--2024-10-01 13:51:57--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘./data/john_wick_2.csv’


### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [8]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"./data/john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  print(f'number of reviews for john_wick_{i} is: {len(movie_docs)} ')
  documents.extend(movie_docs)

number of reviews for john_wick_1 is: 25 
number of reviews for john_wick_2 is: 25 
number of reviews for john_wick_3 is: 25 
number of reviews for john_wick_4 is: 25 


In [8]:
# documents

Let's look at an example document to see if everything worked as expected!

In [9]:
documents[0]

Document(metadata={'source': './data/john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2024, 9, 28, 13, 52, 2, 414598)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

In [10]:
len(documents)

100

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [11]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [12]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [13]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [14]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [15]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [16]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Yes, people generally liked John Wick. The reviews praised the film for its slickness, brilliant action sequences, and Keanu Reeves' performance as the titular character."

In [17]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [18]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement to seek vengeance against gangsters who killed his dog and took everything from him. The story is filled with violent action, shootouts, and breathtaking fights as John Wick unleashes destruction against those who wronged him.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [19]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [20]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [21]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People's opinions on John Wick vary. Some reviewers really enjoyed the action sequences and Keanu Reeves' performance, while others found the film lacking in plot and substance. So, it's safe to say that not everyone liked John Wick."

In [22]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"I'm sorry, there are no reviews with a rating of 10 in the provided context."

In [23]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the movie "John Wick," the main character, portrayed by Keanu Reeves, is a retired hitman seeking vengeance for the death of his dog, which was a final gift from his deceased wife. This sets off a series of intense action sequences as John Wick goes after those responsible, showcasing his exceptional skills in combat and gunfights. The movie is praised for its well-choreographed action scenes and emotional depth within the action genre.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

## Cohere dependency on Langchain in Conflict with RAGAS Dependency - This notebook does not use any Cohere Reranking; Instead I use Langchain Compression Algo Using OpenAI as an LLM

In [24]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

RuntimeError: no validator found for <class 'pydantic.types.SecretStr'>, see `arbitrary_types_allowed` in Config

Let's create our chain again, and see how this does!

In [32]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [25]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [27]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the positive reviews provided.'

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [30]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick: Chapter 2, the main character, John Wick, is forced back into the world of assassins when an Italian baddie calls in a favor and Wick has no choice but to accept. He is then tasked with killing the Italian baddie's sister in Rome, which leads to a contract being placed on him, attracting professional killers from everywhere. Wick promises to kill the Italian baddie, who is no longer protected by his marker."

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [31]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [32]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [33]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [34]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [35]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [36]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'It seems that opinions on John Wick vary. Some people like the series, while others do not.'

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". The URL to that review is: \'/review/rw4854296/?ref_=tt_urv\'.'

In [38]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick 2, John Wick is called on to pay off an old debt by helping Ian McShane take over the Assassin's Guild by flying around to Italy, Canada, and Manhattan and killing what seems like hundreds of assassins."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [39]:
from langchain.retrievers import EnsembleRetriever

# retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [40]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [41]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the positive reviews and high ratings it received.'

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\''

In [43]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, an ex-hit-man comes out of retirement to seek vengeance against the gangsters that killed his dog and took everything from him. This leads to a series of intense shootouts, action-packed sequences, and a relentless pursuit for revenge. The story revolves around John Wick's journey as he unleashes destruction against those who try to harm him, ultimately leading to a showdown with various adversaries in the criminal underworld."

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

In [None]:
# !pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [44]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [45]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [46]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [47]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [48]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [49]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided.'

In [50]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [51]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the movie "John Wick," the protagonist seeks revenge on the people who killed his dog and stole his car. It leads to a lot of action and chaos as John Wick unleashes his skills as a legendary hitman to exact vengeance.'

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [None]:
### YOUR CODE HERE

# MY CODE HERE

## Steps

1.  Use RAGAS to create a synthetic dataset of questions and answers.

2.  For each retriever, construct a `semantic-chunking-on` and `semantic-chunking-off` retrieval pipeline.

3.  Set up LangSmith to track and capture metrics about latency and cost.

4.  Run each pipeline and prepare a table of results.

5.  Summarize in a short paragraph which is best for this particular data and why.

## Synthetically Generate Test Questions Using the RAGAS Pipeline

> NOTE
> -----
>
> ### Because of dependency issues, this section is run once and the RAGAS questions and answers are saved in a csv file and used to evaluate all the retriever pipelines

In [None]:
# !pip install -U -q langchain langchain-openai langchain_core==0.2.38 langchain-text-splitters


# import os
# import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your Langsmith API key here: ")
# os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API key here: ")

### For convenience I also saved a version of the load_movie_reviews as a function to be called 

In [52]:
from myutils.load_movie_reviews import load_movie_reviews

In [53]:
documents = load_movie_reviews()

number of reviews for john_wick_1 is: 25 
number of reviews for john_wick_2 is: 25 
number of reviews for john_wick_3 is: 25 
number of reviews for john_wick_4 is: 25 


In [54]:
import pandas as pd

from ragas.metrics import faithfulness, answer_relevancy, answer_correctness, context_recall, context_precision
from ragas.testset.evolutions import simple, reasoning, multi_context

from myutils.ragas_pipeline import RagasPipeline

  from .autonotebook import tqdm as notebook_tqdm


#### Set Up RAGAS Pipeline Parameters

In [55]:
# LLM models used in RAGAS pipeline
ragas_generator_llm_model = 'gpt-3.5-turbo'
ragas_critic_llm_model = 'gpt-4o-mini'

# embeddings used for RAGAS pipeline
ragas_openai_embeddings_model = 'text-embedding-3-small'

# text splitter params
ragas_chunk_size = 500
ragas_chunk_overlap = 200

# number of qa pairs needed - reduce if running into rate limit issues
ragas_number_of_qa_pairs = 30

# initialize distributions - desired distribution of question types
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

# name of file to persist RAGAS Q&A on disk
ragas_testset_filename = "./data/ragas_questions_and_answers.csv"

In [56]:
# FLAG TO INDICATE IF RAGAS TESTSET SHOULD BE GENERATED IN THIS RUN
# IF it is run, note the cost and time estimate below!!!
generate_ragas_testset_now = False

In [57]:
# set up list of RAGAS metrics used below
ragas_metrics = [
    context_precision,
    context_recall
]

#### Instantiate RAGAS Pipeline, Run Pipeline, Generate Test Questions


In [58]:
# NOTE - this cell will incur significant cost due to SDG's use of OpenAI models
# Time taken on my local machine: ~ 15 mins

ragas_pipeline = RagasPipeline(
        generator_llm_model=ragas_generator_llm_model,
        critic_llm_model=ragas_critic_llm_model,
        embedding_model=ragas_openai_embeddings_model,
        number_of_qa_pairs=ragas_number_of_qa_pairs,
        chunk_size=ragas_chunk_size,
        chunk_overlap=ragas_chunk_overlap,
        documents=documents,
        distributions=distributions
)

In [59]:

if generate_ragas_testset_now is True:
    ragas_testset_df = ragas_pipeline.generate_testset()
    ragas_testset_df.to_csv(ragas_testset_filename)
else:
    pass

#### Load RAGAS Q&A from disk

In [60]:
ragas_test_df = pd.read_csv(ragas_testset_filename)
ragas_test_questions = ragas_test_df["question"].values.tolist()
ragas_test_groundtruths = ragas_test_df["ground_truth"].values.tolist()

## Retrievers and Retrieval Chains

List of Retrievers
------------------

1.  Naive Retriever: 
2.  BM25 Retriever
3.  `Contextual Compression Using Cohere Reranking`
-   NOTE the code is written but not run due to severe dependency conflicts with Langchain version needed for this and for RAGAS.
3.  `Compression using Langchain compresser and OpenAI LLM`
-   NOTE: Implemented in lieu of the one using cohere reranking.
4.  Multi-query Retriever
5.  Parent-document Retriever
6.  Ensemble Retriever

Semantic Chunking
-----------------
Each of the above retrievers relies on chunked documents.  

I will form chunks using two approaches

-   simple chunking (using each review as a chunk)
-   semantic chunking (using the semanticChunker to split into semantically related chunks)

> NOTE on REFACTORED CODE
> -------
>
> I have refactored the code in the first half of the notebook.  All the code to create prompts, retrievers and retrieval chains is contained in a file called `retrievers.py` in the myutils folder in this repo.
> 
> I have kept the original code in the first half of the notebook intact for reference, but plan to use my refactored code in the exercise below

In [None]:
# !pip install -qU langchain langchain-openai langchain-cohere rank_bm25
# !pip install -qU qdrant-client
# !pip install -qU langchain_experimental


# import os
# import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
# os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")
# os.environ["LANGSMITH_API_KEY"]

### Load movie reviews into Documents

### Note - these will be the FIRST SET of the chunks used for the RAG pipeines

In [61]:
from myutils.load_movie_reviews import load_movie_reviews
documents = load_movie_reviews()

number of reviews for john_wick_1 is: 25 
number of reviews for john_wick_2 is: 25 
number of reviews for john_wick_3 is: 25 
number of reviews for john_wick_4 is: 25 


### Get Semantically Chunked Documents

### NOTE - these will be the SECOND SET of chunks used for the RAG pipeline

In [62]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [63]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="interquartile"
)

semantic_documents = semantic_chunker.split_documents(documents)

### Set Up Chat Model for RAG Pipelines

In [64]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

In [65]:
rag_chain_chat_model = ChatOpenAI()

### Set Up Prompt Template and Prompt for RAG Pipelines

In [66]:
RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)


## A Whole Set of Helper Functions

### Helper Function to Get Retrieval Chain After Passing in The Retriever

In [67]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser


def get_retrieval_chain(retriever):
    retrieval_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | {"response": rag_prompt | rag_chain_chat_model, "context": itemgetter("context")}
    )
    return retrieval_chain

### Helper Function To Get Answers to RAGAS-Generated Questions

In [68]:
from datasets import Dataset


def get_responses_to_ragas_questions(retrieval_chain, ragas_questions, ragas_groundtruths):
    """
    Helper function that runs a retrieval chain to generate 
    responses to RAGAS generated questions
    """

    # run RAG pipeline on RAGAS synthetic questions
    answers = []
    contexts = []

    for question in ragas_questions:
        response = retrieval_chain.invoke({"question" : question})
        answers.append(response["response"].content)
        contexts.append([context.page_content for context in response["context"]])

    # Save RAG pipeline results to HF Dataset object
    response_dataset = Dataset.from_dict({
        "question" : ragas_questions,
        "answer" : answers,
        "contexts" : contexts,
        "ground_truth" : ragas_groundtruths
    })

    return response_dataset

### Helper Function to Compute RAGAS Metrics

In [69]:
from ragas import evaluate


def evaluate_rag_pipeline_using_ragas_metrics(response_dataset, ragas_metrics):
    """
    Helper function that takes in responses to RAGAS-questions and evaluates
    performance of the RAG pipeline using ragas metrics
    """

    # Run RAGAS Evaluation - using metrics
    results = evaluate(response_dataset, ragas_metrics)

    # save results to df
    results_df = results.to_pandas()

    return results, results_df

## Set Up All the Retrieval Chains

NOTE #1
----

1.  I will evaluate a total of 12 retrieval chains

2.  For the simple text splitter, i.e., where each review is its own chunk, I will have six retrieval chains.  These are the `naive retriever`, the `bm25 retriever`, the `contextual compression retriever`, the `multi_query retriever`, the `parent document retriever` and the `ensemble retriever`.

3.  For the case when documents are semantically chunked, I have the same six retrieval chains.

NOTE #2
-----

1.  I have coded up the Cohere ReRanker based chain for contextual compression retrieval.  However, there are severe incompabilities between the versions of Langchain needed for this chain to work and for the RAGAS evaluation pipeline to work.  So, you will see the code commented out.

2.  Instead, I am implementing a Langchain-equivalent compression retriever that uses an OpenAI LLM to compress the contexts retrieved.  You will see this labeled clearly below.

In [70]:
import myutils.retrievers
from myutils.retrievers import Retrievers

### Set up retrieval chains for all the non-semantic cases

In [71]:
non_semantic_retriever_set_up = \
    Retrievers(
        documents=documents, 
        collection_name="JohnWick", 
        pd_collection_name="jw_full_documents"
    )

naive_retriever = non_semantic_retriever_set_up.get_retriever("naive_retriever")
naive_retrieval_chain = get_retrieval_chain(naive_retriever)

bm25_retriever = non_semantic_retriever_set_up.get_retriever("bm25_retriever")
bm25_retrieval_chain = get_retrieval_chain(bm25_retriever)

# cohere_contextual_compression_retriever = non_semantic_retriever_set_up.get_retriever("cohere_contextual_compression_retriever")
# cohere_contextual_compression_retrieval_chain = get_retrieval_chain(cohere_contextual_compression_retriever)

langchain_compression_retriever = non_semantic_retriever_set_up.get_retriever("langchain_compression_retriever")
langchain_compression_retrieval_chain = get_retrieval_chain(langchain_compression_retriever)

multi_query_retriever = non_semantic_retriever_set_up.get_retriever("multi_query_retriever")
multi_query_retrieval_chain = get_retrieval_chain(multi_query_retriever)

parent_document_retriever = non_semantic_retriever_set_up.get_retriever("parent_document_retriever")
parent_document_retrieval_chain = get_retrieval_chain(parent_document_retriever)

ensemble_retriever = non_semantic_retriever_set_up.get_retriever("ensemble_retriever")
ensemble_retrieval_chain = get_retrieval_chain(ensemble_retriever)

### Set up retrieval chains for all the semantic cases

In [72]:
semantic_retriever_set_up = \
    Retrievers(
        documents=documents, 
        collection_name="JohnWickSemantic", 
        pd_collection_name="jwsem_full_documents"
    )

naive_retriever = semantic_retriever_set_up.get_retriever("naive_retriever")
naive_retrieval_chain_sem = get_retrieval_chain(naive_retriever)

bm25_retriever = semantic_retriever_set_up.get_retriever("bm25_retriever")
bm25_retrieval_chain_sem = get_retrieval_chain(bm25_retriever)

# cohere_contextual_compression_retriever = semantic_retriever_set_up.get_retriever("cohere_contextual_compression_retriever")
# cohere_contextual_compression_retrieval_chain_sem = get_retrieval_chain(cohere_contextual_compression_retriever)

langchain_compression_retriever = semantic_retriever_set_up.get_retriever("langchain_compression_retriever")
langchain_compression_retrieval_chain_sem = get_retrieval_chain(langchain_compression_retriever)

multi_query_retriever = semantic_retriever_set_up.get_retriever("multi_query_retriever")
multi_query_retrieval_chain_sem = get_retrieval_chain(multi_query_retriever)

parent_document_retriever = semantic_retriever_set_up.get_retriever("parent_document_retriever")
parent_document_retrieval_chain_sem = get_retrieval_chain(parent_document_retriever)

ensemble_retriever = semantic_retriever_set_up.get_retriever("ensemble_retriever")
ensemble_retrieval_chain_sem = get_retrieval_chain(ensemble_retriever)

## Set Up LangSmith Function to Trace LLM Calls And Get Latency, Cost, Etc.

In [73]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"vc Advanced Retrieval Pipelines week7 - {unique_id}"

In [74]:
print(f'project name on Langsmith will be: vc Advanced Retrieval Pipelines week7 - {unique_id} ')

project name on Langsmith will be: vc Advanced Retrieval Pipelines week7 - a6e6f867 


In [75]:
from langsmith import traceable

@traceable(
    run_type="llm",
    name="OpenAI Call Decorator",
    project_name=f"vc Advanced Retrieval Pipelines week7 - {unique_id}"
)
def run_langsmith_eval(retrieval_chain, ragas_questions, ragas_groundtruths):
    response_dataset = get_responses_to_ragas_questions(retrieval_chain, ragas_questions, ragas_groundtruths)
    return response_dataset


## Get Responses to RAGAS Questions - with LangSmith Tracing

### Helper Function To Merge a list of dataframes

In [76]:
def merge_dataframes(list_of_dataframes):
    """
    helper function to merge a list of several dataframes
    """
    final_df = list_of_dataframes[0]
    for df in list_of_dataframes[1:]:
        final_df = pd.merge(final_df, df, on="Metric")
    return final_df    

### Helper Functions to Run RAGAS Questions Using Chains and Collate Results

In [77]:
def ragas_eval_with_langsmith_one_chain(retrieval_chain, chain_type):
    ds = run_langsmith_eval(retrieval_chain, ragas_test_questions, ragas_test_groundtruths)
    measures, _ = evaluate_rag_pipeline_using_ragas_metrics(ds, ragas_metrics)
    df = pd.DataFrame(list(measures.items()), columns=["Metric", chain_type])
    return df

In [78]:
def run_all_chains(list_of_retrieval_chains):
    all_dfs = dict()
    for chain_object in list_of_retrieval_chains:
        chain_type, retrieval_chain = list(chain_object.keys())[0], list(chain_object.values())[0]
        df = ragas_eval_with_langsmith_one_chain(retrieval_chain, chain_type)
        all_dfs[chain_type] = df
    
    list_of_dfs = list(all_dfs.values())
    final_df = merge_dataframes(list_of_dfs)

    return final_df

## Run RAGAS/LangSmith For Non-Semantic Chains

In [79]:
list_of_non_semantic_retrieval_chains = [
    {'naive': naive_retrieval_chain},
    {'bm25': bm25_retrieval_chain},
    {'lccompression': langchain_compression_retrieval_chain},
    {'mqchain': multi_query_retrieval_chain},
    {'pd': parent_document_retrieval_chain},
    {'ensemble': ensemble_retrieval_chain}
]

results_all_non_semantic_chains_df = run_all_chains(list_of_non_semantic_retrieval_chains)

Evaluating: 100%|██████████| 48/48 [00:29<00:00,  1.65it/s]
Evaluating: 100%|██████████| 48/48 [00:15<00:00,  3.11it/s]
Evaluating: 100%|██████████| 48/48 [00:26<00:00,  1.78it/s]
Evaluating: 100%|██████████| 48/48 [00:40<00:00,  1.19it/s]
Evaluating: 100%|██████████| 48/48 [00:06<00:00,  6.89it/s]
Evaluating: 100%|██████████| 48/48 [01:01<00:00,  1.27s/it]


## Create CSV with Latency and Cost and Load Data for Non-Semantic Chains

#### At this point, I copied into a CSV file the cost and latency results from LangSmith for this set of runs

In [80]:
cost_latency_non_sem_df = pd.read_csv('./data/non_semantic_chain_cost_latency.csv')


## Merge Cost/Latency with RAGAS Evaluation Results: Non-Semantic Chains

In [81]:
combined_non_semantic_df = pd.concat([results_all_non_semantic_chains_df, cost_latency_non_sem_df],
                                     ignore_index=True)

## Run RAGAS/LangSmith For Semantic Chains

In [82]:
list_of_semantic_retrieval_chains = [
    {'naive': naive_retrieval_chain_sem},
    {'bm25': bm25_retrieval_chain_sem},
    {'lccompression': langchain_compression_retrieval_chain_sem},
    {'mqchain': multi_query_retrieval_chain_sem},
    {'pd': parent_document_retrieval_chain_sem},
    {'ensemble': ensemble_retrieval_chain_sem}
]

results_all_semantic_chains_df = run_all_chains(list_of_semantic_retrieval_chains)

Evaluating: 100%|██████████| 48/48 [00:27<00:00,  1.73it/s]
Evaluating: 100%|██████████| 48/48 [00:11<00:00,  4.07it/s]
Evaluating: 100%|██████████| 48/48 [00:24<00:00,  1.97it/s]
Evaluating: 100%|██████████| 48/48 [00:37<00:00,  1.29it/s]
Evaluating: 100%|██████████| 48/48 [00:07<00:00,  6.80it/s]
Evaluating: 100%|██████████| 48/48 [01:06<00:00,  1.39s/it]


## Create CSV with Latency and Cost and Load Data for Semantic Chains

In [83]:
cost_latency_sem_df = pd.read_csv('./data/semantic_chain_cost_latency.csv')

## Merge Cost/Latency with RAGAS Evaluation Results: Semantic Chains

#### At this point, I copied into a CSV file the cost and latency results from LangSmith for this set of runs

In [84]:
combined_semantic_df = pd.concat([results_all_semantic_chains_df, cost_latency_sem_df],
                                 ignore_index=True)

## Compare Results

### Non-semantic chunking chains

In [85]:
combined_non_semantic_df

Unnamed: 0,Metric,naive,bm25,lccompression,mqchain,pd,ensemble
0,context_precision,0.748184,0.475694,0.538775,0.657331,0.791667,0.593284
1,context_recall,0.923611,0.673611,0.791667,0.923611,0.861111,0.923611
2,Cost($),0.045234,0.017461,0.171966,0.061756,0.009049,0.239025
3,Latency(sec),39.32,30.68,333.42,87.39,33.41,396.43


### Semantic Chunking Chains

In [86]:
combined_semantic_df

Unnamed: 0,Metric,naive,bm25,lccompression,mqchain,pd,ensemble
0,context_precision,0.719821,0.486111,0.57738,0.670004,0.770833,0.623082
1,context_recall,0.965278,0.673611,0.791667,0.965278,0.861111,0.944444
2,Cost($),0.045293,0.017411,0.171113,0.059637,0.009004,0.23772
3,Latency(sec),42.55,27.81,316.23,89.61,33.79,402.46


## Discussion of Results

1.  Surprisingly, `semantic chunking` did not move the needle very much across any of the retrieval methods.  Sure, there are some improvements, e.g., with `multi-query` but in many other cases, the comparisons were mixed.

2.  As expected, the latency and costs were the lowest for `bm25` as it places the least demands on the use of embeddings for retrieval.  On the other side of the spectrum, `compression-retriever` (as least the one I implemented, which is the LangChain version that uses an OpenAI model) and the `ensemble-retriever` seem to be the most expensie both in terms of cost and latency.  Unsurprisingly, `ensemble-retriever`, which uses all the other methods, has the highest cost and latency, but compression retrieval is very close, suggesting this method is the real bottleneck.

3.  There is no single method that dominates others on the basis of recall, precision and cost/latency.  However, the `naive retriever` comes really close.  Of all the alternatives, it has the highest `recall`, the second-highest `precision` and ranks pretty well on cost and latency.

4.  In my opinion, the `multi-query retriever` seems the most sound method based on intuition.  It does favorably well, coming in with fairly decent recall and precision scores, very comparable to the `naive retriever`.  However, it does suffer on the latency dimension as it does reformulate the query a few times.

## What I would pick for this dataset

#### I would be inclined to lean in favor of `multi-query retriever` as it has sound intuition and does quite well on precision, recall while also doing tolerably well on cost/latency.