# Post Retrieval Strategies

## Reranking Retrieved Chunks

### What is a Reranker?

A reranker is a type of machine learning model used in search systems to reorder a set of retrieved documents by relevance to a user query. Imagine you search for something, and the system pulls a bunch of documents. Not all of them are equally useful, though. The reranker steps in after the initial search to figure out which of those documents are most relevant to what you’re asking. 

At its core, a reranker is typically built using a "cross-encoder" model. Unlike traditional search methods, which compress each document and query separately into vectors, a reranker considers the relationship between the query and each document individually. This allows it to provide a more accurate score for how closely a document matches a query, improving the relevance of the results presented to the user.

### Why a Reranker is Needed

#### Inadequacy of Vector/Keyword Search

Let’s start with the basic problem: vector search, or even traditional keyword search, has limitations. In a typical retrieval-augmented generation (RAG) setup, you’re dealing with a lot of documents—sometimes tens of thousands, other times millions. The first step in a RAG pipeline is usually vector search. Here, documents are turned into numerical representations, or "vectors," and stored in a large vector database. When a user submits a query, it’s also turned into a vector, and the system retrieves documents that are mathematically closest to this query vector.

While this sounds straightforward, there’s a catch. Vector search involves compressing the "meaning" of a document into a fixed-length vector, typically 768 or 1536 dimensions. This compression inevitably leads to information loss. When we’re crunching documents into smaller vectors, there’s no guarantee that every subtle detail of the document's meaning will be preserved. As a result, highly relevant information might be hidden in documents that don’t make it to the top results of the vector search. You might end up retrieving documents that are good but not great, missing key information that could answer the user’s query better.

This problem becomes even more apparent with large datasets. Vector search is good for finding “close enough” documents fast, but it’s often too blunt for identifying nuanced, highly relevant documents. It doesn’t always account for context either. That’s because the vector embeddings are created before the user query even arrives, meaning the search system doesn’t have a chance to fine-tune those embeddings based on the specific question asked.

#### How a Reranker Overcomes This

This is where rerankers shine. A reranker takes the top documents retrieved by the vector search and refines their order based on a deeper understanding of both the query and each document. Instead of treating the query and document separately, as vector searches do, a reranker looks at them together. It applies a large transformer model (like BERT) to both the query and the retrieved document, allowing it to understand the relationship between the two in much greater detail.

Here’s how it works: after the vector search pulls a set of documents, the reranker model pairs the query with each document, feeds them both into the transformer model, and then calculates a similarity score. This score is based on how well the document answers the specific query. In short, the reranker makes decisions based on the exact words in both the query and the document rather than just their vectorized representations.

For example, if the query is “How do rerankers improve RAG pipelines?” a reranker would look at every document retrieved and determine which ones specifically talk about how rerankers improve RAG pipelines—not just documents that vaguely match the topic. This precision comes at a cost: rerankers are slower because they perform a full transformer computation for each query-document pair. But the accuracy boost makes it worth it in many cases.

### Tradeoffs of Using a Reranker

While rerankers improve the accuracy of search results, they come with tradeoffs, primarily in terms of speed and computational cost. Rerankers, especially those based on large transformer models, require significant processing power because they perform a full transformer inference for each query-document pair. This makes them much slower than vector search, which only needs to compute a single query vector and compare it with pre-stored document vectors. For real-time systems with high user traffic, this added latency can be a bottleneck. 

Additionally, the computational cost of reranking increases with the number of documents being reranked. As a result, rerankers are typically used only after an initial retrieval step has reduced the candidate set, balancing the need for accuracy with the need for performance.

### How Rerankers Are Used In A RAG Pipeline

In a typical RAG pipeline, rerankers are used as part of a two-stage retrieval system. The first stage involves the fast retrieval of documents using vector or keyword search. This stage is designed for speed because we want to narrow down millions of documents to just a handful as quickly as possible. 

Once the vector search has pulled the top documents (say, the top 25), the reranker steps in. The reranker takes this smaller set of documents and reorders them based on how well they match the user’s query, using its deeper understanding of the content. This ensures that the top results shown to the user are not just “close enough” but are actually the most relevant documents available.

This combination of vector search for speed and reranking for accuracy strikes a balance between performance and relevance. By using vector search as a first pass to trim down the number of documents and then applying rerankers for fine-tuning, the RAG pipeline can deliver better, more precise results to the LLM, improving the quality of the final output.

### Implementation 
We are going to use Cohere Reranker for this task. It is one of the best rerankers out there.

In [7]:
import weaviate 
from dotenv import load_dotenv
import os

load_dotenv("./../.env")

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)
article = client.collections.get("Article")

{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-09-27T14:51:07+05:30"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-09-27T14:51:07+05:30"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-09-27T14:51:07+05:30"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-09-27T14:51:07+05:30"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":58347},"time":"2024-09-27T14:51:07+05:30"}
{"address":"192.168.69.215:58348","level":"info","msg":"starting cloud rpc server ...","time":"2024-09-27T14:51:07+05:30"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-09-27T14:

In [8]:
import textwrap

def print_objects(objects):
    """
        a function to print the retrieved objects
    """
    for obj in objects:
        print(f"ID: {obj.uuid.int}")
        metadata = [{k: round(v, 2) if isinstance(v, float) else v} for k, v in obj.metadata.__dict__.items() if v is not None]
        print(f"Metadata: {metadata}")
        print(f"Title: {obj.properties['title']}")
        print(f"Date: {obj.properties['date']}")
        print(f"Category: {obj.properties['category']}")
        print(f"Author: {obj.properties['author']}")
        print(f"Body: {textwrap.shorten(obj.properties['body'], width=100)}")
        print()

In [9]:
from weaviate.classes.query import MetadataQuery

query = "what are the fundamental subjects in computer science?"

chunks = article.query.near_text(
    query=query,
    limit=10,
    return_metadata=MetadataQuery(distance=True, certainty=True)
)

print_objects(chunks.objects)

ID: 253766494100517455433948109859044355170
Metadata: [{'distance': 0.63}, {'certainty': 0.69}]
Title: Algorithms
Date: 2021-12-05 00:00:00+00:00
Category: Programming
Author: Towards Data Science
Body: Algorithms are step-by-step instructions or rules designed to perform a task or solve a [...]

ID: 208507677698874478258804158284056307633
Metadata: [{'distance': 0.64}, {'certainty': 0.68}]
Title: Data Structures
Date: 2022-03-12 00:00:00+00:00
Category: Programming
Author: GeeksForGeeks
Body: Data structures are ways of organizing and storing data so that they can be accessed and [...]

ID: 26347448900503708385528402502775929930
Metadata: [{'distance': 0.65}, {'certainty': 0.68}]
Title: Web Development Fundamentals
Date: 2023-02-18 00:00:00+00:00
Category: Web Development
Author: MDN Web Docs
Body: Web development encompasses the building and maintenance of websites. It includes aspects such [...]

ID: 111770961923860747703082376322890795880
Metadata: [{'distance': 0.67}, {'certainty'

In [12]:
import cohere

co = cohere.Client(api_key=os.getenv("COHERE_API_KEY"))

chunks_text = [chunk.properties["body"] for chunk in chunks.objects]

reranked_chunks = co.rerank(
    model="rerank-english-v3.0", 
    query=query, 
    documents=chunks_text, 
    top_n=5, 
    return_documents=True
)

[print(chunk) for chunk in reranked_chunks.results]

document=RerankResponseResultsItemDocument(text='Algorithms are step-by-step instructions or rules designed to perform a task or solve a problem. They are the backbone of computer science, ranging from simple tasks like sorting numbers to more complex operations like machine learning. Efficient algorithms help in saving computational resources.') index=0 relevance_score=0.049958523
document=RerankResponseResultsItemDocument(text='Data structures are ways of organizing and storing data so that they can be accessed and worked with efficiently. Common types include arrays, linked lists, stacks, queues, and trees. Understanding data structures is crucial in optimizing algorithms and improving the performance of programs.') index=1 relevance_score=0.009559399
document=RerankResponseResultsItemDocument(text="Object-oriented programming (OOP) is a paradigm based on the concept of 'objects', which can contain data and code. The four key principles of OOP are encapsulation, abstraction, inherit

[None, None, None, None, None]