In [1]:
!uv pip install numpy==1.26.4

[2mUsing Python 3.12.0 environment at: /Users/suljain/opt/anaconda3/envs/rag_env[0m
[2mAudited [1m1 package[0m [2min 2ms[0m[0m


SBERT/bi-encoder, ColBERT, and cross-encoder/re-ranker are three strategies used for information retrieval in RAG systems.



### BERT (Bidirectional Encoder Representations from Transformers)

In [2]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example input text
text = "hello world"

# Tokenize input text
inputs = tokenizer(text, return_tensors='pt')

  from .autonotebook import tqdm as notebook_tqdm


![title](image/bert.png)

In [3]:
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
print(tokens)


['[CLS]', 'hello', 'world', '[SEP]']


Every input begins with a special [CLS] token that signals the start of the text (short for “classification”). The corresponding [CLS] embedding vector contains information about the entire input and can be used for subsequent tasks.

BERT can handle either a single sentence or a pair of sentences. If two sentences are involved, they are separated by a distinct [SEP] token that marks the boundary between them.

In summary, the BERT encoder model produces an embedding vector for each input token, as well as one or two additional special embeddings

In [4]:
# Pass inputs through BERT model
with torch.no_grad():
    outputs = model(**inputs)

# Get embedding tensor
embeddings = outputs.last_hidden_state
print(embeddings.shape) # [batch, number of tokens, embedding dimension]

torch.Size([1, 4, 768])


This means that we have one batch and four embeddings, each with a dimensionality of 768.

### SBERT = Sentence BERT

![title](image/sbert.png)

While the original BERT model focuses on token-level embeddings, SBERT builds upon BERT to generate meaningful sentence-level embeddings.

The key idea behind SBERT is simple yet powerful. It averages the embeddings of all the tokens in a sentence, a technique called mean pooling, to produce a single vector that captures the sentence’s overall meaning.

After averaging all the token-level embedding vectors, we can measure the similarity between two vectors using cosine similarity. Cosine similarity yields a value y between -1 and +1, which indicates how similar two vectors are.



In [5]:
from sentence_transformers import SentenceTransformer

# Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The text to encode
query = "How is the weather today?"
docs = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "I like cats.",
]

# Calculate vector embeddings by calling model.encode()
query_embedding = model.encode(query)
docs_embeddings = model.encode(docs) # we could store these in a database


# Calculate the cosine similarities
cos_similarities = model.similarity(query_embedding, docs_embeddings)
print(cos_similarities)

tensor([[0.7551, 0.5361, 0.0707]])


This means that the first document is the best match, while the third document is not.

In an RAG system, all document embeddings are stored in a vector database. At runtime, we convert the query text into an embedding vector and search for matches in the vector database. This process is very fast and can be optimized.

However, SBERT’s conversion of entire documents (or smaller document chunks) into a single vector leads to information loss and, consequently, a loss of retrieval accuracy. Additionally, with this method, queries and documents are embedded separately. The model never actually “sees” the query and document text together.

### Cross-Encoder

![title](image/cross-encoder.png)

Unlike SBERT, which treats the query and document independently, a re-ranker looks at them together and predicts a relevance score y that reflects how well the document answers the query. The higher the score, the better the match. The [CLS] embedding can be used as input for a small neural network that is trained to produce relevance scores.

In [6]:
from sentence_transformers import CrossEncoder

# Load a pretrained re-ranker model
cross_encoder = CrossEncoder(
    "cross-encoder/ms-marco-TinyBERT-L-2-v2", max_length=512, device="cpu"
)

# The text to rank
query = "How is the weather today?"
docs = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "I like cats.",
]

# Calculate relevance scores by calling cross_encoder.rank()
scores = cross_encoder.rank(
    query=query,
    documents=docs,
    return_documents=True,
)
print(scores)

[{'corpus_id': 0, 'score': 8.2618475, 'text': 'The weather is lovely today.'}, {'corpus_id': 1, 'score': -7.8940616, 'text': "It's so sunny outside!"}, {'corpus_id': 2, 'score': -11.505094, 'text': 'I like cats.'}]


The higher the scores, the better the match between query and document.

Unlike SBERT, a re-ranker model sees the query and document text together, allowing it to produce much better results. However, re-ranking is much slower and is usually only applied to the top K results after the initial retrieval stage. For instance, it might re-rank the top 20 documents to find the five most relevant ones.

### Colbert = Contextualized Late interaction BERT

Unlike SBERT, ColBERT does not average BERT’s output embeddings into a single vector. Instead, it works with the token-level embeddings from BERT.

To calculate the relevance score between a query and a document using ColBERT, we create a matrix containing the query and document tokens. The cosine similarity between the query token $q$ and the document token $d$ is contained in each cell.

![title](image/colbert.png)

We select the maximum value for each row in that matrix. Then, we sum all the selected values. This $MaxSim$ approach yields a relevance score between a query and a single document. This process must be repeated for all documents.

Based on search retrieval accuracy and speed, ColBERT falls between the bi-encoder and the cross-encoder. One downside of ColBERT is that multiple embedding vectors must be stored for each document. This significantly increases the amount of storage needed in a vector database.

ColBERT can be used for both first-stage retrieval and as an alternative to a re-ranker for narrowing down retrieved candidates.

ColBERT is supported by a few vector databases, such as Qdrant and Weaviate. There is also a Stanford GitHub repository: `stanford-futuredata/ColBERT`.

### Summary

The table below summarizes the differences between them. Depending on your use case and requirements, you can use one or a combination of them.

![title](image/summary.png)

A bi-encoder is the fastest option, but it has the lowest retrieval accuracy. ColBERT is slower and requires more vector storage, but its accuracy is better than that of the bi-encoder. Both methods can be used to retrieve documents based on a query.

A re-ranker is typically used in the second stage of retrieval to narrow down the list of candidates. Re-rankers are slow but produce the best results.

To achieve optimal results in terms of speed, accuracy, and storage, I recommend using a combination of a bi-encoder and a re-ranker. First, the bi-encoder retrieves an initial set of candidates. Then, the re-ranker narrows down the list to the top results.



### References:
1. https://ai.gopubby.com/three-different-retrieval-strategies-in-rag-systems-e9434fd80f35