# Re-Ranking

In the context of RAG (Retrieval-Augmented Generation), reranking of retrieval results is a crucial step that refines the initial set of retrieved documents based on their relevance to the input query. This process involves re-scoring the retrieved documents using a more sophisticated model, such as a cross-encoder, to better capture the semantic similarity between the query and the documents. The reranked list of documents is then used as input for the generation model, ensuring that the most relevant and accurate information is utilized to generate the final output.

![Cross Encoder Image](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png)


https://www.sbert.net/examples/applications/retrieve_rerank/README.html

In [1]:
from rich.console import Console
from rich_theme import custom_theme

# Create a console with the dark theme
console = Console(theme=custom_theme)


In [2]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

In [3]:
from sentence_transformers import CrossEncoder 
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Loading retrival results

We will load the retrival results from the previous Hybrid-Search notebook, to avoid repeatition.

In [4]:
import json
retrieved_documents = json.load(open('data/retrieved_docs.json'))

In [17]:
console.print(retrieved_documents)

In [20]:
# This is the query that we used for the retrieval of the above documents
query = "What is context size of Mixtral?"


## Calculating the re-ranking scores

In [19]:
pairs = [[query, doc['text']] for doc in retrieved_documents] 
scores = cross_encoder.predict(pairs) 

console.print(scores)

## Selecting top 3 reranked documents

In [16]:
import numpy as np
print("Top 3 Documents:") 
for o in np.argsort(scores)[::-1][:3]:
    print(retrieved_documents[o]['text'])

Top 3 Documents:
expertsâ ) to process the token and combine their output additively. This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Mixtral is pretrained with multilingual data using a context size of 32k tokens. It either matches or exceeds the performance of Llama 2 70B and GPT-3.5, over several benchmarks. In particular, Mixture of Experts Layer i gating inputs af outputs router expert

This chunk describes the key architectural details of the Mixtral model, a sparse mixture-of-experts language model that outperforms larger models like Llama 2 70B and GPT-3.5 on various benchmarks.
3.8 â Mixtral_8x7B 3.5 32 > $3.0 i] 228 fos a 2.0 0 5k 10k 15k 20k 25k 30k Context length Passkey Performance ry 3.8 â Mixtral_8x7B 3.5 0.8 32 > 0.6 $3.0 i] 228 04 fos 0.2 a 2.0 0.0 OK 4K 8K 12K 16K 20K 24K 28K 0 5k 10k 15k 20k 25k 30k Seq Len Context length Figure 4: Long range p