# Reranking

<img src="./media/reranking.png" width=600>

[Image: Rerankers and Two-Stage Retrieval](https://www.pinecone.io/learn/series/rag/rerankers/)

As semantic similarity becomes a core technique for delivering context to LLM-based applications, the challenge of **finding truly relevant information** grows more important. Most modern systems use **embedding models** to convert unstructured text into vector representations, storing these in a vector database for fast similarity-based retrieval.

While this first-step retrieval process is efficient and scalable, **the top results may not always be the best-aligned passages for a given query.** They might be “near matches,” but not the most contextually relevant.
This is where **reranking** comes in: a second-stage process that reorders the initially retrieved set to better match the true information need.

Reranking uses a dedicated model—typically a **cross-encoder** or a **late interaction model** to directly compare each candidate passage with the query, assigning a fine-grained relevance score. By re-evaluating these candidate passages, rerankers help surface  the most useful, specific, and accurate results to the top.

In this notebook, we’ll explore the most popular reranking approaches in modern RAG pipelines, with an intuitive look at how these models work and how they improve retrieval quality.

---
## Reranking Models

<img src="./media/embedding.png" width=600>

[Choosing the Right Embedding Model for RAG in Generative AI](https://medium.com/bright-ai/choosing-the-right-embedding-for-rag-in-generative-ai-applications-8cf5b36472e1)

In a two stage RAG pipeline, we rely on a few different pre-trained encoder models to convert our unstructured content (generally text) into dense vector representations that capture the learned semantics of language through scaled machine learning. The first of which is the commonly known "Embedding Model" set up as a bi-encoder, one for the query one for the document(s). As this notebook is meant to focus on reranking, I will just provide a brief overview of base embedding models in this context, but if you'd like to learn the specifics you can check out [my guide on bidirectional encoder representations from transformers](https://www.youtube.com/watch?v=n_UQ0e0fBIA)!

### Context: Bi-Encoder (Embedding Model)

<img src="./media/biencoder.png" width=300>

[Cross-encoders vs Bi-encoders : A deep-dive into text encoding methods](https://medium.com/@rbhatia46/cross-encoders-vs-bi-encoders-a-deep-dive-into-text-encoding-methods-d9aa890d6ca4)

The **bi-encoder architecture** is widely used for vector similarity search in retrieval systems, where the base embedding model independently encodes queries and documents into numerical vectors (“embeddings”) that capture their meaning. At search time, these embeddings are quickly compared (usually via cosine similarity) to find the most relevant matches. This dual encoder architecture is where the "bi" from bi-encoder comes from, and typically uses models like [BERT](https://arxiv.org/pdf/1810.04805), which leverage the transformer architecture and attention mechanism to learn powerful language representations via objectives like masked language modeling (MLM). In MLM, the model is trained to predict missing words in sentences, allowing it to develop a deep understanding of context and semantics after training on many millions of examples, allowing the model to perform this dense vector encoding.

As a quick clarification, the “bi” in bi-encoder refers to using **two separate encoding passes** (one for queries, one for documents), often with shared model weights which is separate from BERT’s “bidirectional” attention, which enables each token to attend to both left and right context during training.

Let's take a quick look at what this looks like with a popular embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

#### Load the Model

In [26]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

#### Define the Query & Documents

In [27]:
query = "Who is the best technical AI youtuber?"

documents = [
    "The dogwood is the state flower of North Carolina",
    "Adam Lucek makes videos about artificial intelligence",
    "The canada goose has a lifespan of 10-24 years"]

#### Example Encoding

Here we can see what happens when the text is encoded into a numerical form. In the case of the embedding model we're using, our sentence will be mapped to a 384 dimensional dense vector space.

In [29]:
query_embedding = embedding_model.encode(query)

print("First 10 dimensions: ", query_embedding[:10])
print("\n Total Size: ",len(query_embedding), "dimensions.")

First 10 dimensions:  [-0.08816597 -0.12000839 -0.03506935 -0.10491663  0.01010485 -0.007575
  0.08377688  0.0845268   0.01181077 -0.02637845]

 Total Size:  384 dimensions.


#### Embedding Documents Independently

In [30]:
doc_embeddings = embedding_model.encode(documents)

#### Computing Similarity

In [31]:
similarity = embedding_model.similarity(query_embedding, doc_embeddings)

print(similarity)

tensor([[-0.0982,  0.6025, -0.0030]])


We can see that our query: *Who is the best technical AI youtuber?* is most similar to the embedding in position 2 (index 1): *Adam Lucek makes videos about artificial intelligence* 😎

This is the backbone of bi-encoder based vector similarity based retrieval, converting and storing document embeddings then comparing them at run time to embedded queries. This done a lot more efficiently than our quick example using [vector databases](https://github.com/ALucek/embeddings-guide/blob/main/WTF_VDB.ipynb), but supports our first stage of the two stage RAG process- initial retrieval. Once our retrieved set is available, we can then employ our reranking models to provide a more refined ranking of relevant context for query responses.

### Cross Encoder

<img src="./media/cross-encoder.png" width=800>

[The Illustrated Guide to Cross-Encoders: From Deep to Shallow](https://medium.com/@kakumar1611/the-illustrated-guide-to-cross-encoders-from-deep-to-shallow-2a23a8630016)

The **cross-encoder architecture** extends base embedding models by training them as classifiers for direct semantic similarity or relevance. Rather than computing document and query embeddings independently for later comparison, cross-encoders take the query and each document and **concatenate them as a single input** (e.g., `[CLS] query [SEP] document [SEP]`) to the model. The output is a **direct relevance score** for the query-document pair, typically produced by the model’s classification \[CLS] token (or an output head attached to it).

The key advantage of the cross-encoder is that **every token in the input (query and document) can attend to every other token**, allowing for richer, fine-grained interactions between query and document. This usually yields higher accuracy and more nuanced matching than bi-encoders. However, this comes at a computational cost as each query-document pair must be processed together, which is slower than similarity search with precomputed embeddings, hence why it's often applied as the second step in retrieval after a smaller candidate set has been retrieved.

#### Load the Model

For this example we'll be using the popular cross-encoder [ms-marco-MiniLM-L6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2)

In [32]:
from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

#### Define the Query & Documents

In [33]:
query = "What are the health benefits of meditation?"

documents = [
    "Several clinical studies have shown that regular meditation can help reduce stress and anxiety levels.",
    "Meditation involves focusing the mind and eliminating distractions, often through breathing techniques or guided imagery.",
    "A daily meditation practice has been associated with lower blood pressure and improved sleep quality in adults.",
    "The city of Kyoto is famous for its Zen temples, where meditation has been practiced for centuries.",
    "People who meditate frequently often report feeling calmer and more focused throughout the day.",
    "Research suggests meditation may lower the risk of heart disease by reducing inflammation and improving heart rate variability.",
    "Meditation apps have become increasingly popular, offering guided sessions on mindfulness and relaxation.",
    "A 2021 meta-analysis found that meditation can reduce symptoms of depression when used alongside other treatments.",
    "Some forms of meditation emphasize compassion and kindness, aiming to improve emotional well-being.",
    "Athletes sometimes use meditation techniques to enhance concentration and mental resilience during competition.",
]

#### Rank Documents

In [34]:
ranks = cross_encoder.rank(query, documents)

print("="*25, "Cross Encoder Rankings", "="*25, "\n")
for rank in ranks:
    print(f"{rank['score']:.2f}\t{documents[rank['corpus_id']]}")


7.99	Research suggests meditation may lower the risk of heart disease by reducing inflammation and improving heart rate variability.
6.95	Several clinical studies have shown that regular meditation can help reduce stress and anxiety levels.
6.31	A 2021 meta-analysis found that meditation can reduce symptoms of depression when used alongside other treatments.
5.87	A daily meditation practice has been associated with lower blood pressure and improved sleep quality in adults.
2.54	Some forms of meditation emphasize compassion and kindness, aiming to improve emotional well-being.
1.02	Meditation involves focusing the mind and eliminating distractions, often through breathing techniques or guided imagery.
0.47	Athletes sometimes use meditation techniques to enhance concentration and mental resilience during competition.
-2.84	People who meditate frequently often report feeling calmer and more focused throughout the day.
-2.97	Meditation apps have become increasingly popular, offering guide

As you can see we're able to order the passages a lot more closer to what's relevant to our query! Whereas a regular bi-encoder retrieval may focus too much on keywords and themes we're able to clearly extract the specifics.

### Late Interaction Model

<img src="./media/multi.png" width=800>

[ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/pdf/2004.12832)

The late interaction architecture came after the popularization of using a cross-encoder with the release of ColBERT research and introduces a unique approach inspired by both the bi-encoder and cross-encoder setup. Late interaction based reranking takes the set of retrieved documents and query and splits each into their respective tokens. The query and each candidate document are then encoded as a matrix of token level vectors. This is opposed to the pooled single vector representation that is output by a base embedding model that combines all the token level embeddings together. Then for each query token the similarity is compared to all document token vectors and the maximum value is kept. This maximum similarity is aggregated across all query tokens to produce the final relevancy score for each document.

<img src="./media/late-interaction.excalidraw.png" width=800>

Mathematically, given:

- Query $Q$ with tokens $q_1, q_2, \ldots, q_m$
- Candidate document $D$ with tokens $d_1, d_2, \ldots, d_n$

Each token is encoded into a vector: $Q = [\mathbf{q}_1, \mathbf{q}_2, \ldots, \mathbf{q}_m]$, $D = [\mathbf{d}_1, \mathbf{d}_2, \ldots, \mathbf{d}_n]$.

For each query token $\mathbf{q}_j$: $s_j = \max_k \left( \cos (\mathbf{q}_j, \mathbf{d}_k) \right)$

The final document relevance score $S$ is aggregated by sum or mean:
$S = \sum_{j=1}^m s_j$, or $S = \frac{1}{m} \sum_{j=1}^m s_j$

#### Ranking with ColBERT

We'll be using [ColBERT V2](https://huggingface.co/colbert-ir/colbertv2.0) for our demonstration. Along with the model is an [official repo](https://github.com/stanford-futuredata/ColBERT/tree/main) that includes specific abstractions and functions meant to process documents and ranking with ColBERT and the proposed late interaction approach. But for demonstrations sake we'll implement late interaction directly with `transformers` and `torch`.

#### Load the Model

We'll use the [transformers](https://huggingface.co/docs/transformers/en/index) package directly to load the tokenizer and model from the 🤗 Hub.

In [35]:
from transformers import AutoTokenizer, AutoModel

colbert_tokenizer = AutoTokenizer.from_pretrained("colbert-ir/colbertv2.0")
colbert = AutoModel.from_pretrained("colbert-ir/colbertv2.0")

#### Define the Query & Documents

Same from before!

In [36]:
query = "What are the health benefits of meditation?"

documents = [
    "Several clinical studies have shown that regular meditation can help reduce stress and anxiety levels.",
    "Meditation involves focusing the mind and eliminating distractions, often through breathing techniques or guided imagery.",
    "A daily meditation practice has been associated with lower blood pressure and improved sleep quality in adults.",
    "The city of Kyoto is famous for its Zen temples, where meditation has been practiced for centuries.",
    "People who meditate frequently often report feeling calmer and more focused throughout the day.",
    "Research suggests meditation may lower the risk of heart disease by reducing inflammation and improving heart rate variability.",
    "Meditation apps have become increasingly popular, offering guided sessions on mindfulness and relaxation.",
    "A 2021 meta-analysis found that meditation can reduce symptoms of depression when used alongside other treatments.",
    "Some forms of meditation emphasize compassion and kindness, aiming to improve emotional well-being.",
    "Athletes sometimes use meditation techniques to enhance concentration and mental resilience during competition.",
]

#### Helper Functions

We'll be defining two helper functions, `get_token_embeddings` and `colbert_score`.

`get_token_embeddings` will compute the token-level embeddings through the model, taking in text and first running it through the tokenizer to split into individual tokens. Those tokens are then sent through the model to create a vector representation for each. We remove any special classifier or separator tokens that may have been output and keep just the individual embeddings.

`colbert_score` then takes in the query embeddings and document embeddings, which have the shape of matrices:

* The **query embeddings** have shape \$(m, d)\$, where \$m\$ is the number of query tokens and \$d\$ is the embedding dimension.
* The **document embeddings** have shape \$(n, d)\$, where \$n\$ is the number of document tokens.

The scoring function computes the **cosine similarity** between each query token embedding and every document token embedding, resulting in an \$(m, n)\$ similarity matrix. For each query token, we take the **maximum similarity** value across all document tokens. These maximum similarities are then **summed** across all query tokens to produce the final relevance score for the document.

In [37]:
import torch
import torch.nn.functional as F

def get_token_embeddings(text, tokenizer, model):
    # Get token-level embeddings, ignore [CLS] and [SEP]
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # outputs.last_hidden_state: (1, seq_len, hidden_dim)
    input_ids = inputs['input_ids'][0]
    keep_indices = (input_ids != tokenizer.cls_token_id) & (input_ids != tokenizer.sep_token_id)
    
    return outputs.last_hidden_state[0][keep_indices]  # (filtered_seq_len, hidden_dim)

def colbert_score(query_emb, doc_emb):
    
    # query_emb: (m, d), doc_emb: (n, d)
    sim = F.cosine_similarity(query_emb.unsqueeze(1), doc_emb.unsqueeze(0), dim=2)  # (m, n)
    
    max_sim, _ = sim.max(dim=1)  # (m,)
    
    return max_sim.sum().item()

#### Rank Documents

Now we put it all together by embedding the query and each document, computing the scores, then sorting the documents by relevance score

In [38]:
# Compute token level embeddings
query_emb = get_token_embeddings(query, colbert_tokenizer, colbert)
doc_embs = [get_token_embeddings(doc, colbert_tokenizer, colbert) for doc in documents]

# Compute scores
scores = [colbert_score(query_emb, doc_emb) for doc_emb in doc_embs]

ranking = sorted(zip(scores, documents), reverse=True)

print("="*25, "Late Interaction Rankings", "="*25, "\n")
for score, doc in ranking:
    print(f"{score:.2f}\t{doc}")


6.02	Research suggests meditation may lower the risk of heart disease by reducing inflammation and improving heart rate variability.
5.93	A daily meditation practice has been associated with lower blood pressure and improved sleep quality in adults.
5.88	Several clinical studies have shown that regular meditation can help reduce stress and anxiety levels.
5.71	A 2021 meta-analysis found that meditation can reduce symptoms of depression when used alongside other treatments.
5.63	Some forms of meditation emphasize compassion and kindness, aiming to improve emotional well-being.
5.12	Meditation apps have become increasingly popular, offering guided sessions on mindfulness and relaxation.
5.10	Athletes sometimes use meditation techniques to enhance concentration and mental resilience during competition.
4.91	Meditation involves focusing the mind and eliminating distractions, often through breathing techniques or guided imagery.
4.41	People who meditate frequently often report feeling calm

<img src="./media/colbert_evals.png" width=800>

In the ColBERT research, they found that this late interaction approach provided comparable performance to existing cross encoders and other reranking methods but was quicker and required much less computation!

## Putting it Together

Now that we have an understanding of the models and approaches for reranking, let's create a simple RAG pipeline that can query and rerank results for a RAG response

### Vector Database Setup

#### Text Chunking

For our database we'll do a simplified chunking setup on [The Adventures of Sherlock Holmes](https://www.gutenberg.org/ebooks/1661) as available through Project Gutenberg. We'll grab the text from the website and setup a very simple recursive token chunker that first splits the text into sentences, then combines into chunks of roughly 400 tokens long. These will be our candidate chunks for retrieval embedded into our vector database!

In [39]:
import requests
import nltk
import tiktoken

# Download Sherlock Holmes text
url = "https://www.gutenberg.org/files/1661/1661-0.txt"
response = requests.get(url)
text = response.text

# Sentence split
# Download punkt_tab sentence tokenizer from the natural language toolkit
nltk.download('punkt_tab', quiet=True)
sentences = nltk.sent_tokenize(text)

# Setup GPT-4 tokenizer @ 400 tokens target
enc = tiktoken.get_encoding("cl100k_base")
token_limit = 400

# Initiate chunking index and list
chunks = []
current_chunk = ""
current_tokens = 0

# Chunk
for sentence in sentences:
    sentence_tokens = len(enc.encode(sentence))
    # If adding this sentence would go over the limit, start a new chunk
    if current_tokens + sentence_tokens > token_limit:
        if current_chunk:
            chunks.append(current_chunk)
        current_chunk = sentence
        current_tokens = sentence_tokens
    else:
        if current_chunk:
            current_chunk += " " + sentence
        else:
            current_chunk = sentence
        current_tokens += sentence_tokens

# Don't forget the last chunk!
if current_chunk:
    chunks.append(current_chunk)

print(f"Number of chunks: {len(chunks)}")
print("\nSample chunk:\n", chunks[10])

Number of chunks: 380

Sample chunk:
 “Your Majesty had not spoken before I
was aware that I was addressing Wilhelm Gottsreich Sigismond von
Ormstein, Grand Duke of Cassel-Felstein, and hereditary King of
Bohemia.”

“But you can understand,” said our strange visitor, sitting down once
more and passing his hand over his high white forehead, “you can
understand that I am not accustomed to doing such business in my own
person. Yet the matter was so delicate that I could not confide it to
an agent without putting myself in his power. I have come _incognito_
from Prague for the purpose of consulting you.”

“Then, pray consult,” said Holmes, shutting his eyes once more. “The facts are briefly these: Some five years ago, during a lengthy
visit to Warsaw, I made the acquaintance of the well-known adventuress,
Irene Adler. The name is no doubt familiar to you.”

“Kindly look her up in my index, Doctor,” murmured Holmes without
opening his eyes. For many years he had adopted a system of docketin

#### VDB Initialization

We'll use [ChromaDB](https://www.trychroma.com/) as our lightweight database of choice. ChromaDB by default relies on the embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for embedding documents and queries.

In [None]:
import chromadb

# Instantiate the Chroma Client
chroma_client = chromadb.Client()

# Create a Collection
collection = chroma_client.get_or_create_collection(name="sherlock_holmes")

# Embed Chunks to the Collection
collection.add(
    documents=chunks,
    ids=[str(i) for i in range(len(chunks))]
)

#### 1st Stage Retrieval

We'll create a simple function to handle the initial retrieval from the collection, what's performed before reranking.

In [40]:
def retrieve_docs(query, collection="sherlock_holmes", n=25):
    # Load Chroma Collection
    collection = chroma_client.get_or_create_collection(name=collection)

    # Perform semantic search
    results = collection.query(
        query_texts=[query],
        n_results=n
    )

    # Zip documents and distances together into dicts
    docs = results["documents"][0]
    scores = results["distances"][0]

    # Combine into list of dicts
    return [{"document": doc, "score": score} for doc, score in zip(docs, scores)]

#### 2nd Stage Reranking - Cross Encoder

The first of two reranking functions, running our retrieved through a cross encoder, using the same as in our prior example.

In [47]:
from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_with_cross_encoder(query, results, cross_encoder_model=cross_encoder):
    # Grab chunks from dictionary
    documents = [r['document'] for r in results]
    
    # Compute cross encoder relevancy score
    rerank_scores = cross_encoder_model.predict([(query, doc) for doc in documents])
    
    for r, score in zip(results, rerank_scores):
        r['cross_encoder_score'] = float(score)
    
    # Sort results by cross_encoder_score, descending
    results = sorted(results, key=lambda x: x['cross_encoder_score'], reverse=True)
    
    return results

#### 2nd Stage Reranking - Late Interaction

The second of our reranking functions, late interaction using ColBERT once more! We'll redefine our prior two helper functions again if this is being ran in isolation from the walkthrough code above.

In [42]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load ColBERTv2 model & tokenizer just once (at top level)
colbert_tokenizer = AutoTokenizer.from_pretrained("colbert-ir/colbertv2.0")
colbert_model = AutoModel.from_pretrained("colbert-ir/colbertv2.0")

def get_token_embeddings(text, tokenizer, model):
    # Get token-level embeddings, ignore [CLS] and [SEP]
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    input_ids = inputs['input_ids'][0]
    keep_indices = (input_ids != tokenizer.cls_token_id) & (input_ids != tokenizer.sep_token_id)
    return outputs.last_hidden_state[0][keep_indices]  # (filtered_seq_len, hidden_dim)

def colbert_score(query_emb, doc_emb):
    # query_emb: (m, d), doc_emb: (n, d)
    sim = F.cosine_similarity(query_emb.unsqueeze(1), doc_emb.unsqueeze(0), dim=2)  # (m, n)
    max_sim, _ = sim.max(dim=1)  # (m,)
    return max_sim.sum().item()

def rerank_with_late_interaction(query, results, tokenizer=colbert_tokenizer, model=colbert_model):
    # Precompute query embeddings once
    query_emb = get_token_embeddings(query, tokenizer, model)
    for r in results:
        doc_emb = get_token_embeddings(r['document'], tokenizer, model)
        r['late_interaction_score'] = colbert_score(query_emb, doc_emb)
    # Sort results by late_interaction_score descending
    results = sorted(results, key=lambda x: x['late_interaction_score'], reverse=True)
    return results

### Testing it Out!

Let's now run the entire pipeline and compare our outputs

**Stage 1: Retrieval only**

In [48]:
query = "Show moments where Holmes or Watson reflect on friendship."

# Stage 1: Retrieve top 50
retrieved_docs = retrieve_docs(query, collection="sherlock_holmes", n=50)

print("="*20, "Top 5 Retrieved (Semantic Search Only)", "="*20)
for i, r in enumerate(retrieved_docs[:5]):
    print(f"[{i+1}] Score: {r['score']:.4f}\n{r['document'][:200]}...\n")

[1] Score: 0.9414
“If the police are to
hush this thing up, there must be no more of Hugh Boone.”

“I have sworn it by the most solemn oaths which a man can take.”

“In that case I think that it is probable that n...

[2] Score: 0.9463
I simply wish to hear your real, real opinion.”

“Upon what point?”

“In your heart of hearts, do you think that Neville is alive?”

Sherlock Holmes seemed to be embarrassed by the question. “Fr...

[3] Score: 0.9467
It may be so in this
case, also.”

“Well, let us hope so. But our doubts will very soon be solved, for
here, unless I am much mistaken, is the person in question.”

As he spoke the door opened a...

[4] Score: 0.9587
He
said that if they were sent to the office he would be chaffed by all
the other clerks about having letters from a lady, so I offered to
typewrite them, like he did his, but he wouldn’t have that...

[5] Score: 0.9642
He wore rather baggy grey shepherd’s check trousers,
a not over-clean black frock-coat, unbuttoned in the fron

**Stage 2: Cross-Encoder reranking**

In [51]:
cross_encoder_reranked = rerank_with_cross_encoder(query, retrieved_docs)

print("="*20, "Top 5 After Cross Encoder Rerank", "="*20)
for i, r in enumerate(cross_encoder_reranked[:5]):
    print(f"[{i+1}] Cross-Encoder Score: {r['cross_encoder_score']:.2f}\n{r['document'][:200]}...\n")

[1] Cross-Encoder Score: -1.56
“Very sorry to knock you up, Watson,” said he, “but it’s the common lot
this morning. Mrs. Hudson has been knocked up, she retorted upon me,
and I on you.”

“What is it, then—a fire?”

“No; a cl...

[2] Cross-Encoder Score: -1.81
Holmes unlocked his strong-box and held up the blue
carbuncle, which shone out like a star, with a cold, brilliant,
many-pointed radiance. Ryder stood glaring with a drawn face, uncertain
whether t...

[3] Cross-Encoder Score: -2.61
A few moments later he was in our room, still puffing, still
gesticulating, but with so fixed a look of grief and despair in his
eyes that our smiles were turned in an instant to horror and pity. Fo...

[4] Cross-Encoder Score: -2.77
Holmes cut the cord and removed the transverse bar. Then he
tried the various keys in the lock, but without success. No sound came
from within, and at the silence Holmes’ face clouded over. “I trust...

[5] Cross-Encoder Score: -2.84
Sherlock Holmes sat moodily at one sid

**Stage 3: Late Interaction reranking**

In [52]:
late_interaction_reranked = rerank_with_late_interaction(query, retrieved_docs)

print("="*20, "Top 5 After Late Interaction Rerank", "="*20)
for i, r in enumerate(late_interaction_reranked[:5]):
    print(f"[{i+1}] Late Interaction Score: {r['late_interaction_score']:.2f}\n{r['document'][:200]}...\n")

[1] Late Interaction Score: 6.50
This gentleman, Mr. Wilson, has been my partner and helper
in many of my most successful cases, and I have no doubt that he will
be of the utmost use to me in yours also.”

The stout gentleman hal...

[2] Late Interaction Score: 6.44
Holmes unlocked his strong-box and held up the blue
carbuncle, which shone out like a star, with a cold, brilliant,
many-pointed radiance. Ryder stood glaring with a drawn face, uncertain
whether t...

[3] Late Interaction Score: 6.38
THE ADVENTURE OF THE ENGINEER’S THUMB


Of all the problems which have been submitted to my friend, Mr.
Sherlock Holmes, for solution during the years of our intimacy, there
were only two which I...

[4] Late Interaction Score: 6.30
Holmes cut the cord and removed the transverse bar. Then he
tried the various keys in the lock, but without success. No sound came
from within, and at the silence Holmes’ face clouded over. “I trust...

[5] Late Interaction Score: 6.27
A few moments later he was in

### Combining with an LLM for RAG

Now that we have our retrieval and reranking systems in place, we can now combine everything into a full fledged RAG system.

In [53]:
from openai import OpenAI

def rag_response(query, ranking="none", k=5):

    # Instantiate OpenAI Client
    client = OpenAI()

    # Retrieve our initial set of 50 documents
    retrieved_documents = retrieve_docs(query, n=50)  # returns list of dicts with 'document' (and scores)

    # Rerank and sort based on our three methods
    if ranking == "none":
        sorted_docs = sorted(retrieved_documents, key=lambda r: r['score'])
        docs = [r['document'] for r in sorted_docs[:k]]
    elif ranking == "cross_encoder":
        reranked = rerank_with_cross_encoder(query, retrieved_documents)
        docs = [r['document'] for r in reranked[:k]]
    elif ranking == "late_interaction":
        reranked = rerank_with_late_interaction(query, retrieved_documents)
        docs = [r['document'] for r in reranked[:k]]
    else:
        raise ValueError("Argument 'ranking' must be one of ['none', 'cross_encoder', 'late_interaction']") 

    # DEBUG PRINT: show which chunks are being provided
    print(f"\nProvided Chunks (top {k}, ranking: {ranking}):\n" + "-"*60)
    for i, doc in enumerate(docs, 1):
        print(f"[{i}] {doc[:250]}{'...' if len(doc)>250 else ''}\n")  # Show first 250 chars for readability
    print("="*50)

    # Combine top documents for passage to LLM
    context = "\n\n".join(docs)
    prompt = f"""You are a Sherlock Holmes expert. Use ONLY the following passages from the stories to answer the question.

Passages:
{context}

Question: {query}

If you cannot find an answer in the passages, reply: "The answer is not shown in the provided context." Otherwise, answer as specifically as possible using the text above.
"""

    # Generate response from the model
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5
    )
    
    return response.choices[0].message.content.strip()

**Simple Semantic Search**

In [54]:
answer = rag_response("Show moments where Holmes or Watson reflect on friendship.", ranking="none")
print(answer)


Provided Chunks (top 5, ranking: none):
------------------------------------------------------------
[1] “If the police are to
hush this thing up, there must be no more of Hugh Boone.”

“I have sworn it by the most solemn oaths which a man can take.”

“In that case I think that it is probable that no further steps may be
taken. But if you are foun...

[2] I simply wish to hear your real, real opinion.”

“Upon what point?”

“In your heart of hearts, do you think that Neville is alive?”

Sherlock Holmes seemed to be embarrassed by the question. “Frankly,
now!” she repeated, standing upon the rug ...

[3] It may be so in this
case, also.”

“Well, let us hope so. But our doubts will very soon be solved, for
here, unless I am much mistaken, is the person in question.”

As he spoke the door opened and a young lady entered the room. She was
plainly...

[4] He
said that if they were sent to the office he would be chaffed by all
the other clerks about having letters from a lady, so I offered t

**Reranking with Cross Encoder**

In [55]:
answer = rag_response("Show moments where Holmes or Watson reflect on friendship.", ranking="cross_encoder")
print(answer)


Provided Chunks (top 5, ranking: cross_encoder):
------------------------------------------------------------
[1] “Very sorry to knock you up, Watson,” said he, “but it’s the common lot
this morning. Mrs. Hudson has been knocked up, she retorted upon me,
and I on you.”

“What is it, then—a fire?”

“No; a client. It seems that a young lady has arrived in a ...

[2] Holmes unlocked his strong-box and held up the blue
carbuncle, which shone out like a star, with a cold, brilliant,
many-pointed radiance. Ryder stood glaring with a drawn face, uncertain
whether to claim or to disown it. “The game’s up, Ryder,” s...

[3] A few moments later he was in our room, still puffing, still
gesticulating, but with so fixed a look of grief and despair in his
eyes that our smiles were turned in an instant to horror and pity. For
a while he could not get his words out, but swa...

[4] Holmes cut the cord and removed the transverse bar. Then he
tried the various keys in the lock, but without success. No 

**Reranking with Late Interaction**

In [56]:
answer = rag_response("Show moments where Holmes or Watson reflect on friendship.", ranking="late_interaction")
print(answer)


Provided Chunks (top 5, ranking: late_interaction):
------------------------------------------------------------
[1] This gentleman, Mr. Wilson, has been my partner and helper
in many of my most successful cases, and I have no doubt that he will
be of the utmost use to me in yours also.”

The stout gentleman half rose from his chair and gave a bob of
greeting,...

[2] Holmes unlocked his strong-box and held up the blue
carbuncle, which shone out like a star, with a cold, brilliant,
many-pointed radiance. Ryder stood glaring with a drawn face, uncertain
whether to claim or to disown it. “The game’s up, Ryder,” s...

[3] THE ADVENTURE OF THE ENGINEER’S THUMB


Of all the problems which have been submitted to my friend, Mr.
Sherlock Holmes, for solution during the years of our intimacy, there
were only two which I was the means of introducing to his notice—that
...

[4] Holmes cut the cord and removed the transverse bar. Then he
tried the various keys in the lock, but without success. No

---
## Discussion

In the above notebook we've demonstrated a two step retrieval and reranking process covering the tradition bi-encoder architecture for simple retrieval then cross-encoder and late interaction style setups and models for computing. The workings and findings of such are outlined in the below summary table:

**Comparison Table**

| Step         | Retrieval (Embedding)          | Rerank: Late Interaction (ColBERT)   | Rerank: Cross-Encoder           |
| ------------ | ------------------------------ | ------------------------------------ | ------------------------------- |
| Query Encode | 1 vector per query             | Matrix of vectors per query          | Joint encoding (query+doc pair) |
| Doc Encode   | 1 vector per doc (pre-compute) | Matrix of vectors per doc (pre-comp) | N/A (encode per query+doc pair) |
| Scoring      | Cosine/dot similarity          | MaxSim per query token + aggregate   | Full transformer, \[CLS] output |
| Compute cost | Very low                       | Moderate (matrix op per pair)        | High (full forward per pair)    |

In essence, the bi-encoder is a recall-oriented coarse retrieval tool, while the cross-encoder and late interaction re-ranker is a precision-oriented fine reranking step. By relying on the traditional bi-encoder to quickly create a subset that can then be refined by a more compute intensive reranking model and improve retrieval results and downstream LLM generated contextual responses.