[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/weaviate-features/services-research/contextual_retrieval.ipynb)

# Contextual Retrieval

Notebook author: Danny Williams @ Weaviate


## Overview

[Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval) is a technique to improve the accuracy of vector search by providing additional context for the chunks of a document, by inputting both the document and the chunk to an LLM and asking it to provide a succinct context for the chunk within the document.

This is a way to combat the lost context problem that occurs in chunking, e.g., if a text is split into sentences, the context of later sentences as they relate to earlier sentences is lost.

### Example

We'll use the following example to illustrate this. Consider the following text (generated and not from a real company):


In [1]:
text = """
The recent SEC filing provided insights into ACME Corp's performance for Q2 2023. 
It highlighted a 3% revenue growth over the previous quarter. 
The company, which had a revenue of $314 million in the prior quarter, showed steady progress. 
They attributed this growth to strategic initiatives and operational efficiencies. 
The report emphasized the company's resilience and ability to navigate market challenges, reflecting positively on their financial health and future prospects.
""".strip().replace("\n", "")

And splitting this into sentences, we get the following chunks:

In [2]:
chunks = text.split(".")
chunks = [chunk.strip() + "." for chunk in chunks if len(chunk) > 0]

print("Raw chunks:")
print("-" * 100)
for chunk in chunks:
    print(chunk)
    print("-" * 100)

Raw chunks:
----------------------------------------------------------------------------------------------------
The recent SEC filing provided insights into ACME Corp's performance for Q2 2023.
----------------------------------------------------------------------------------------------------
It highlighted a 3% revenue growth over the previous quarter.
----------------------------------------------------------------------------------------------------
The company, which had a revenue of $314 million in the prior quarter, showed steady progress.
----------------------------------------------------------------------------------------------------
They attributed this growth to strategic initiatives and operational efficiencies.
----------------------------------------------------------------------------------------------------
The report emphasized the company's resilience and ability to navigate market challenges, reflecting positively on their financial health and future prospects.
-

From the second sentence onwards, the sentences do not state anything about ACME Corp, even though the sentences are explicitly referring to this company. So the context is lost.

### Comparison Methods

Recent methods have been proposed to alleviate this problem, such as Contextual Retrieval (discussed above) and Late chunking, which you can find more information about [here](https://jina.ai/news/late-chunking-in-long-context-embedding-models/) and [here](https://weaviate.io/blog/late-chunking), and an implementation in Weaviate [here](https://github.com/weaviate/recipes/blob/main/weaviate-features/services-research/late_chunking_berlin.ipynb).


## Setup


First, we install the packages relevant to this notebook.

In [3]:
%%capture
!pip install anthropic sentence_transformers einops pandas tabulate

In [4]:
import warnings
from tqdm import TqdmWarning

warnings.filterwarnings("ignore", category=TqdmWarning)

In [5]:
import anthropic
import sentence_transformers
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
import os

print(f"Using Anthropic: {anthropic.__version__}")
print(f"Using Sentence Transformers: {sentence_transformers.__version__}")
print(f"Using Pandas: {pd.__version__}")
print(f"Using NumPy: {np.__version__}")

Using Anthropic: 0.34.2
Using Sentence Transformers: 3.1.1
Using Pandas: 2.2.3
Using NumPy: 2.1.1


Then we need to set up the LLM API, so we can pass the chunks and document to the LLM.

In this instance we are going to use Anthropic's Claude API, and Claude 3 Haiku, similar to the original outline in the web post for Contextual Retrieval.

To reproduce this notebook, you will need to have an Anthropic API key. You can get one [here](https://console.anthropic.com/settings/keys), and you will need to place this in an environment variable (within `.env`) called `ANTHROPIC_API_KEY`.

In [6]:
api_key = os.environ.get("ANTHROPIC_API_KEY")
client = anthropic.Anthropic(api_key=api_key)

We are going to be performing semantic search, so we need to set up the embedding model. In this instance, we are using the Jina Embedding v3 model, which is a new model that is specifically designed for text matching. We also use the Jina model partly because it is lightweight, and partly because it works well with late chunking, which we will be providing comparisons for.

In [7]:
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
tokenizer = model.tokenizer

## Contextual Retrieval Setup

First, we specify the prompt we will use to generate the contextual chunks (same as in the original webpost), which just asks the LLM to provide a succinct context for the chunk within the document.

In [8]:
# contextual chunks
anthropic_prompt = """
<document> 
{{WHOLE_DOCUMENT}} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else. 
""".strip()
anthropic_prompt = anthropic_prompt.replace("{{WHOLE_DOCUMENT}}", text)


Now we create a list of augmented chunks, which includes the context of the document provided by the LLM as well as the chunk itself.

In [9]:
contextual_chunks = []
for chunk in chunks:
    anthropic_prompt = anthropic_prompt.replace("{{CHUNK_CONTENT}}", chunk)
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        messages=[
            {"role": "user", "content": anthropic_prompt}
        ],
        max_tokens=1000,
    )
    contextual_chunks.append(response.content[0].text + " " + chunk)

In [10]:
print("Contextual chunks:")
print("-" * 100)
for chunk in contextual_chunks:
    print(chunk)
    print("-" * 100)

Contextual chunks:
----------------------------------------------------------------------------------------------------
The document provides an overview of ACME Corp's financial performance in Q2 2023 based on their recent SEC filing. The recent SEC filing provided insights into ACME Corp's performance for Q2 2023.
----------------------------------------------------------------------------------------------------
The document provides an overview of ACME Corp's financial performance in Q2 2023 based on the recent SEC filing. It highlighted a 3% revenue growth over the previous quarter.
----------------------------------------------------------------------------------------------------
The document provides an overview of ACME Corp's financial performance in Q2 2023. The company, which had a revenue of $314 million in the prior quarter, showed steady progress.
----------------------------------------------------------------------------------------------------
The document provides an 

## Late Chunking Setup

We are going to compare against late chunking, which is a method of embedding the text at the token level and then averaging the embeddings for each chunk.

First, we obtain the token embeddings for the document (which is length `num_tokens x embedding_dim`), and we will later average these to get a single embedding for each chunk.

In [11]:
# Ensure the model is on the CPU
model._first_module().auto_model.to('cpu')

tokenized_text = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Move inputs to CPU
inputs = {key: value.to('cpu') for key, value in tokenized_text.items()}

token_embeddings = model._first_module().auto_model(**inputs).last_hidden_state

Now we need to record the positions of the tokens that represent the start and end of each chunk, i.e. the start and end of each sentence denoted by a full stop.

In [12]:
tokens = tokenized_text.tokens()
positions = []
start = 1
for i, token in enumerate(tokens):
    if token == ".":
        positions.append((start, i+1))
        start = i+1


Just to verify that we have the correct positions, we will convert the indices back to text and print them, and they should match the earlier chunks (up to some tokenization differences).

In [13]:
# Convert indices back to text using tokenizer
converted_texts = []
for start, end in positions:
    token_ids = tokenized_text['input_ids'][0][start:end]
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    text = tokenizer.convert_tokens_to_string(tokens)
    converted_texts.append(text)

print("Converted texts:")
print("-" * 100)
for text in converted_texts:
    print(text)
    print("-" * 100)


Converted texts:
----------------------------------------------------------------------------------------------------
The recent SEC filing provided insights into ACME Corp's performance for Q2 2023.
----------------------------------------------------------------------------------------------------
It highlighted a 3% revenue growth over the previous quarter.
----------------------------------------------------------------------------------------------------
The company, which had a revenue of $314 million in the prior quarter, showed steady progress.
----------------------------------------------------------------------------------------------------
They attributed this growth to strategic initiatives and operational efficiencies.
----------------------------------------------------------------------------------------------------
The report emphasized the company's resilience and ability to navigate market challenges, reflecting positively on their financial health and future prospec

## Embeddings

Now that we have set up the contextually augmented chunks and the late chunked embeddings, we can compute the embeddings for each chunk by either pooling the token embeddings for each chunk (late chunking) or by encoding the contextual chunk directly (contextual chunking). Additionally, we will compute the naive chunking embeddings by encoding the chunks directly. The naive chunking embeddings are the standard way of doing embeddings, i.e. encoding the text directly without any context added.

In [14]:
# late chunking embeddings
late_chunk_embeddings = []
for start, end in positions:
    late_chunk_embeddings.append(token_embeddings.squeeze()[start:end, :].mean(0).float().detach().numpy())

In [15]:
# contextual chunking embeddings
contextual_chunk_embeddings = []
for chunk in contextual_chunks:
    contextual_chunk_embeddings.append(model.encode(chunk))

In [16]:
# Naive chunking embeddings
naive_chunk_embeddings = []
for chunk in chunks:
    naive_chunk_embeddings.append(model.encode(chunk))

## Comparison

We will be querying across these embeddings and comparing the results. We will use cosine similarity to compare the embeddings, given first.

In [17]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

Now for a given query, we want to see how similar the query is to each chunk using cosine similarity.

In [18]:
query = "What is ACME Corp's revenue growth for Q2 2023?"
query_embedding = model.encode(query)


The next code calculates these cosine similarities for each method.

In [19]:
late_chunking_similarities = [cosine_similarity(query_embedding, chunk_embedding) for chunk_embedding in late_chunk_embeddings]
contextual_chunking_similarities = [cosine_similarity(query_embedding, chunk_embedding) for chunk_embedding in contextual_chunk_embeddings]
naive_chunking_similarities = [cosine_similarity(query_embedding, chunk_embedding) for chunk_embedding in naive_chunk_embeddings]


The next cell displays these similarities (in markdown format) for each method. Since our query is based on the revenue growth of ACME Corp, we are looking out for the similarity score for the chunk that contains this information, which in this case is the second chunk. 

In [20]:
import pandas as pd

# Create a DataFrame to store the results
results = {
    "Chunk": chunks,
    "Late Chunking Similarity": late_chunking_similarities,
    "Contextual Retrieval Similarity": contextual_chunking_similarities,
    "Naive Chunking Similarity": naive_chunking_similarities
}

df_results = pd.DataFrame(results)

# Function to format the similarities column-wise
def format_similarities_column_wise(df, column_name):
    max_sim = df[column_name].max()
    second_max_sim = df[column_name][df[column_name] != max_sim].max()
    
    df[column_name] = df[column_name].apply(lambda x: f"**{x:.4f}**" if x == max_sim else f"*{x:.4f}*" if x == second_max_sim else f"{x:.4f}")
    return df

# Apply the formatting function to each similarity column
df_results = format_similarities_column_wise(df_results, "Late Chunking Similarity")
df_results = format_similarities_column_wise(df_results, "Contextual Retrieval Similarity")
df_results = format_similarities_column_wise(df_results, "Naive Chunking Similarity")

# Make the second chunk bold
df_results.iloc[1, 0] = f"**{df_results.iloc[1, 0].strip()}**"

# Display the DataFrame as a table
from IPython.display import display
from IPython.display import Markdown

def df_to_markdown(df):
    markdown_str = df.to_markdown(index=False)
    return markdown_str

markdown_results = df_to_markdown(df_results)
display(Markdown(markdown_results))


| Chunk                                                                                                                                                           | Late Chunking Similarity   | Contextual Retrieval Similarity   | Naive Chunking Similarity   |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------|:----------------------------------|:----------------------------|
| The recent SEC filing provided insights into ACME Corp's performance for Q2 2023.                                                                               | 0.8305                     | 0.8069                            | **0.8505**                  |
| **It highlighted a 3% revenue growth over the previous quarter.**                                                                                               | **0.8516**                 | **0.8590**                        | *0.6343*                    |
| The company, which had a revenue of $314 million in the prior quarter, showed steady progress.                                                                  | *0.8424*                   | *0.8546*                          | 0.6169                      |
| They attributed this growth to strategic initiatives and operational efficiencies.                                                                              | 0.7997                     | 0.8234                            | 0.5191                      |
| The report emphasized the company's resilience and ability to navigate market challenges, reflecting positively on their financial health and future prospects. | 0.8022                     | 0.8061                            | 0.6007                      |

The table highlights, in bold, the chunk that contains the information that the query is looking for.

The similiarities in bold are the highest similiarities for each method, and the similiarities in italics are the second highest.

As expected, both contextual retrieval and late chunking have a high similarity score for the second chunk - which is correctly identifying the chunk with this information.

Naive chunking, on the other hand, does not correctly identify the chunk with this information, since the context of the first sentence is lost. Instead, it matches to the first chunk since that contains more similarity with the words of the query, rather than the semantic meaning.