# Retrieval prototype

Task: Given a question, you have to find the **best wikipedia article** that answers it.

For this task, we will try semantic search with the following models:

- sentence-transformers/all-mpnet-base-v2 (109M): According to [SBERT documentation](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#semantic-search-models), the all-mpnet-base-v2 model provides the best quality. It is relatively small at 109M. We will use this as a baseline.
- sentence-transformers/gtr-t5-xxl (4.86B): [Meta DPR](https://github.com/facebookresearch/DPR) uses the powerful gtr-t5-xxl model. At 4.86b parameters, it is quite large. Creating index takes almost 2h A100

Metrics:
- Top-1 accuracy


Resources:
https://medium.com/@nadikapoudel16/advanced-rag-implementation-using-hybrid-search-reranking-with-zephyr-alpha-llm-4340b55fef22

In [2]:
import os 
from datasets import load_dataset
from langchain.docstore.document import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
import json
import numpy as np
from tqdm.notebook import tqdm
from FlagEmbedding import FlagReranker
from utils import get_vector_db, get_query_embeddings
from logger import logger

os.environ["CUDA_VISIBLE_DEVICES"] = "1"

In [3]:
wiki_dataset = load_dataset("wikipedia", "20220301.simple")
if not os.path.exists("wiki_data.jsonl"):
    wiki_dataset["train"].to_json("wiki_data.jsonl")  

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [4]:
documents = []
for row in wiki_dataset["train"].select(range(10000)):
    doc = Document(
        page_content=row["text"], 
        metadata={
            "id": row["id"],
            "url": row["url"],
            "title": row["title"],
        }
    )
    documents.append(doc)
logger.info(documents[0])

[32m 2025-01-31 19:01:32,590 - INFO - page_content='April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.

April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.

April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.

The Month 

April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.

April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.

In common years, April starts on the same day of the week 

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(documents)

logger.info(f'Total number of splits: {len(all_splits)}')
logger.info(f'{all_splits[0]}')

[32m 2025-01-31 19:01:33,561 - INFO - Total number of splits: 45571[0m
[32m 2025-01-31 19:01:33,562 - INFO - page_content='April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.

April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.

April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.

The Month 

April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.

April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245

In [6]:
# Load train dataset
train_dataset = []
with open('train.jsonl', 'r', encoding='utf-8') as file:
    for line in file:
        train_dataset.append(json.loads(line.strip()))

In [7]:
# Baseline
max_score = 0
for idx, item in enumerate(train_dataset):
    max_score += item['points']
logger.info(f'Max score: {max_score}')

[32m 2025-01-31 19:01:33,622 - INFO - Max score: 1356381[0m


## sentence-transformers/all-mpnet-base-v2

In [8]:
model = "all_mpnet_base_v2"
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs = {'device':'cuda'},
    encode_kwargs = {'normalize_embeddings': False})

db = get_vector_db(model=model, embeddings=embeddings)



Index found, loading...


In [9]:
print("Processing queries...")
query_embeddings = np.array(get_query_embeddings(train_dataset, db, embedding_path=f"query_embeddings_{model}.pkl"), dtype=np.float32)
D, I = db.index.search(query_embeddings, k=5)  # k = number of nearest neighbors

print("Processing results...")
total_score = 0
results = [] 

for idx, item in enumerate(train_dataset):
    query = item['question']
    gold_title = item['article']
    points = item['points']
    
    # Store top 5 retrieved documents
    retrieved_docs = []
    for rank in range(5): 
        doc_index = I[idx][rank]
        distance = D[idx][rank]
        
        doc = db.docstore._dict[db.index_to_docstore_id[doc_index]]
        doc_content = doc.page_content
        doc_metadata = doc.metadata
        retrieved_title = doc_metadata['title']

        retrieved_docs.append({
            "rank": rank + 1,
            "title": retrieved_title,
            "content": doc_content,
            "distance": float(distance)  
        })

    # Check if the highest-ranked document matches the gold title
    if retrieved_docs[0]["title"] == gold_title:
        total_score += item['points']

    results.append({
        "query": query,
        "gold_article": gold_title,
        "points": points, 
        "retrieved_docs": retrieved_docs
    })

# Save results to JSON
output_data = {
    "model": model,
    "total_score": total_score,
    "accuracy_percentage": round(total_score / max_score * 100, 2),
    "results": results
}

with open(f"retrieved_results_{model}.json", "w", encoding="utf-8") as f:
    json.dump(output_data, f, ensure_ascii=False, indent=4)

print(f"{model} score: {total_score} ({output_data['accuracy_percentage']}%)")
print(f"Saved top 5 retrieved documents to retrieved_results_{model}.json")


Processing queries...
Processing results...
all_mpnet_base_v2 score: 652667 (48.12%)
Saved top 5 retrieved documents to retrieved_results_all_mpnet_base_v2.json


Hypothesis: more powerful embedding model may capture semantics of document chunks better, resulting in better retrieval

## sentence-transformers/gtr-t5-xxl

In [8]:
# Load embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/gtr-t5-xxl",
    model_kwargs = {'device':'cuda'},
    encode_kwargs = {'normalize_embeddings': False})

[32m 2025-01-31 19:01:43,499 - INFO - Load pretrained SentenceTransformer: sentence-transformers/gtr-t5-xxl[0m




In [9]:
db = get_vector_db(model_name = "gtr_t5_xxl", embeddings=embeddings)

[32m 2025-01-31 19:01:54,952 - INFO - Loading faiss with AVX2 support.[0m
[32m 2025-01-31 19:01:54,979 - INFO - Successfully loaded faiss with AVX2 support.[0m


Index found, loading...


In [11]:
question = train_dataset[0]['question']
answer = train_dataset[0]['article']

retrieved_docs = db.similarity_search_with_score(question, k = 15)  # db.similarity_search_with_score(question)

print(f"Question:{question}\n=========")
print(f"Gold article:{answer}\n=========")
for retr_doc in retrieved_docs:
    print(f"Title: {retr_doc[0].metadata['title']}\nContent: {retr_doc[0].page_content}\nScore:{retr_doc[1]}")
    print("=========")

Question:how do living organisms in a natural environment respond to changes in weather or climate?
Gold article:Environment
Title: Global warming
Content: As the Earth's surface temperature becomes hotter the sea level rises. This is partly because water over  expands when it gets warmer. It is also partly because warm temperatures make glaciers and ice caps melt. The sea level rise causes coastal areas to flood. Weather patterns, including where and how much rain or snow there is, are changing. Deserts will probably increase in size. Colder areas will warm up faster than warm areas. Strong storms may become more likely and farming may not make as much food. These effects will not be the same everywhere. The changes from one area to another are not well known.

Governments have agreed to keep temperature rise below , but current plans by governments are not enough to limit global warming that much.
Score:0.6273342967033386
Title: Weather
Content: Weather is the day-to-day or hour-to-h

In [10]:
print("Processing queries...")
model = "gtr_t5_xxl"
query_embeddings = np.array(get_query_embeddings(train_dataset, db, embedding_path=f"query_embeddings_{model}.pkl"), dtype=np.float32)
D, I = db.index.search(query_embeddings, k=5)  # k = number of nearest neighbors

print("Processing results...")
total_score = 0
results = [] 

for idx, item in enumerate(train_dataset):
    query = item['question']
    gold_title = item['article']
    points = item['points']
    
    # Store top 5 retrieved documents
    retrieved_docs = []
    for rank in range(5): 
        doc_index = I[idx][rank]
        distance = D[idx][rank]
        
        doc = db.docstore._dict[db.index_to_docstore_id[doc_index]]
        doc_content = doc.page_content
        doc_metadata = doc.metadata
        retrieved_title = doc_metadata['title']

        retrieved_docs.append({
            "rank": rank + 1,
            "title": retrieved_title,
            "content": doc_content,
            "distance": float(distance)  
        })

    # Check if the highest-ranked document matches the gold title
    if retrieved_docs[0]["title"] == gold_title:
        total_score += item['points']

    results.append({
        "query": query,
        "gold_article": gold_title,
        "points": points, 
        "retrieved_docs": retrieved_docs
    })

output_data = {
    "model": model,
    "total_score": total_score,
    "accuracy_percentage": round(total_score / max_score * 100, 2),
    "results": results
}

with open(f"retrieved_results_{model}.json", "w", encoding="utf-8") as f:
    json.dump(output_data, f, ensure_ascii=False, indent=4)

print(f"{model} score: {total_score} ({output_data['accuracy_percentage']}%)")
print(f"Saved top 5 retrieved documents to retrieved_results_{model}.json")


Processing queries...
Processing results...
gtr_t5_xxl score: 712993 (52.57%)
Saved top 5 retrieved documents to retrieved_results_gtr_t5_xxl.json


After visual inspection:
- Usually, the gold article is somewhere in the top 5 retrieved documents
- There seems to be some noise in the dataset, there is more than one article that can answer the question. For example:

```text
"query": "what is the name of the largest city in romania?",
"gold_article": "Bucharest",
"points": 52,
"retrieved_docs": [
    {
        "rank": 1,
        "title": "Romania",
        "content": "Religion\nRomania is a secular state. This means Romania has no national religion. The biggest religious group in Romania is the Romanian Orthodox Church. It is an autocephalous church inside of the Eastern Orthodox communion. In 2002, this religion made up 86.7% of the population. Other religions in Romania include Roman Catholicism (4.7%), Protestantism (3.7%), Pentecostalism (1.5%) and the Romanian Greek-Catholicism (0.9%).\n\nCities\n\nBucharest is the capital of Romania. It also is the biggest city in Romania, with a population of over 2 millions peoples.\n\nThere are 5 other cities in Romania that have a population of more than 300,000 people. These are Iaşi, Cluj-Napoca, Timişoara, Constanţa, and Craiova. Romania also has 5 cities that have more than 200,000 people living in them: Galaţi, Braşov, Ploieşti, Brăila, and Oradea.\n\nThirteen other cities in Romania have a population of more than 100,000 people.\n\nEconomy",
        "distance": 0.49112021923065186
    },
```

**"...Cities: Bucharest is the capital of Romania. It also is the biggest city in Romania..."**

To better measure utility, it may be better to measure top-5 accuracy along with some LLM-as-a-judge to catch edge cases. However, for the sake of this task, we will stick to top-1 accuracy. 

Since we commonly see that the "best" doc is in top 5 but not the top 1, using a **reranker** may help improve performance.

## Using smaller but more powerful embedder for speed

Referred to [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Using BAAI/bge-large-en-v1.5

In [17]:
model_name = "BAAI/bge-large-en-v1.5"
print(model_name)
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs = {'device':'cuda'},
    encode_kwargs = {'normalize_embeddings': False})

model_name = model_name.split("/")[1]
model_name = model_name.replace("-", "_")

[32m 2025-01-31 17:56:45,512 - INFO - Load pretrained SentenceTransformer: BAAI/bge-large-en-v1.5[0m


BAAI/bge-large-en-v1.5


In [18]:
db = get_vector_db(model_name, embeddings, all_splits)

Index found, loading...


In [19]:
print("Processing queries...")
query_embeddings = np.array(get_query_embeddings(train_dataset, db, embedding_path=f"query_embeddings_{model_name}.pkl"), dtype=np.float32)
D, I = db.index.search(query_embeddings, k=5)  # k = number of nearest neighbors

print("Processing results...")
total_score = 0
results = [] 

for idx, item in enumerate(train_dataset):
    query = item['question']
    gold_title = item['article']
    points = item['points']
    
    # Store top 5 retrieved documents
    retrieved_docs = []
    for rank in range(5): 
        doc_index = I[idx][rank]
        distance = D[idx][rank]
        
        doc = db.docstore._dict[db.index_to_docstore_id[doc_index]]
        doc_content = doc.page_content
        doc_metadata = doc.metadata
        retrieved_title = doc_metadata['title']

        retrieved_docs.append({
            "rank": rank + 1,
            "title": retrieved_title,
            "content": doc_content,
            "distance": float(distance)  
        })

    # Check if the highest-ranked document matches the gold title
    if retrieved_docs[0]["title"] == gold_title:
        total_score += item['points']

    results.append({
        "query": query,
        "gold_article": gold_title,
        "points": points, 
        "retrieved_docs": retrieved_docs
    })

output_data = {
    "model_name": model_name,
    "total_score": total_score,
    "accuracy_percentage": round(total_score / max_score * 100, 2),
    "results": results
}

with open(f"retrieved_results_{model_name}.json", "w", encoding="utf-8") as f:
    json.dump(output_data, f, ensure_ascii=False, indent=4)

print(f"{model_name} score: {total_score} ({output_data['accuracy_percentage']}%)")
print(f"Saved top 5 retrieved documents to retrieved_results_{model_name}.json")


Processing queries...
Processing results...
bge_large_en_v1.5 score: 732194 (53.98%)
Saved top 5 retrieved documents to retrieved_results_bge_large_en_v1.5.json


## Hybrid search

Can keyword search help us?

In [13]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25_retriever = BM25Retriever.from_documents(all_splits)
bm25_retriever.k = 10
faiss_retriever = db.as_retriever(search_kwargs={"k": 10}) 

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.25,0.75]
)

In [None]:
results = []
total_score = 0

for idx, item in enumerate(tqdm(train_dataset)): # for each validation query
    question = item['question']
    points = item['points']

    retrieved_docs = ensemble_retriever.invoke(question)
    retrieved_data = [
        {
            "title": doc.metadata["title"],
            "content": doc.page_content,
        }
        for doc in retrieved_docs
    ]

    retr_doc = retrieved_docs[0]
    retr_title = retr_doc.metadata['title']
    gold_title = item['article']

    if retr_title == gold_title:
        total_score += item['points']
        
    results.append({
        "query": question,
        "gold_article": gold_title,
        "points": points, 
        "retrieved_docs": retrieved_data
    })

output_data = {
    "model": model,
    "total_score": total_score,
    "accuracy_percentage": round(total_score / max_score * 100, 2),
    "results": results
}

with open(f"retrieved_results_hybrid.json", "w", encoding="utf-8") as f:
    json.dump(output_data, f, ensure_ascii=False, indent=4)
logger.info(f"Hybrid score: {total_score} ({output_data['accuracy_percentage']}%)")


## Reranker

In [12]:
def rerank_topk(reranker, question, documents):
    all_docs_ls = []
    titles = []  

    for document in documents:
        doc_content = document.page_content
        doc_title = document.metadata["title"]
        qs_doc_ls = [question, doc_content]
        all_docs_ls.append(qs_doc_ls)
        titles.append(doc_title)

    scores = reranker.compute_score(all_docs_ls)
    zipped_lists = list(zip(scores, all_docs_ls, titles))  # Include titles
    sorted_lists = sorted(zipped_lists, key=lambda x: x[0], reverse=True)
    sorted_scores, sorted_original, sorted_titles = zip(*sorted_lists)
    result_new = [Document(page_content=content[1], metadata={"title": title}) 
                  for content, title in zip(sorted_original, sorted_titles)]
    return result_new, list(sorted_scores)

In [14]:
retriever = db.as_retriever(search_kwargs={"k": 15}) 
reranker = FlagReranker('BAAI/bge-reranker-large')
results = []
total_score = 0

for idx, item in enumerate(tqdm(train_dataset[:20])):
    question = item['question']
    points = item['points']

    retrieved_docs = retriever.invoke(question)
    retrieved_docs, reranked_scores = rerank_topk(reranker, question, retrieved_docs)[:5]
    
    retrieved_data = [
        {
            "title": doc.metadata["title"],
            "content": doc.page_content,
            "score": score
        }
        for doc, score in zip(retrieved_docs, reranked_scores)
    ]

    retr_doc = retrieved_docs[0]
    retr_title = retr_doc.metadata['title']
    gold_title = item['article']

    if retr_title == gold_title:
        total_score += item['points']
        
    results.append({
        "query": question,
        "gold_article": gold_title,
        "points": points, 
        "retrieved_docs": retrieved_data
    })

output_data = {
    "model": model,
    "total_score": total_score,
    "accuracy_percentage": round(total_score / max_score * 100, 2),
    "results": results
}

with open(f"retrieved_results_rerank.json", "w", encoding="utf-8") as f:
    json.dump(output_data, f, ensure_ascii=False, indent=4)

print(f"Rerank score: {total_score} ({output_data['accuracy_percentage']}%)")

  0%|          | 0/20 [00:00<?, ?it/s]

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Rerank score: 686 (0.05%)


In [11]:
import faiss

# Create a CPU-based FAISS index
cpu_index = faiss.IndexFlatL2(128)

# Move FAISS index to GPU
gpu_resources = faiss.StandardGpuResources()  # Initialize GPU resources
gpu_index = faiss.index_cpu_to_gpu(gpu_resources, 0, cpu_index)  # 0 = GPU ID

print(type(gpu_index))  # Output: <class 'faiss.swigfaiss.GpuIndexFlatL2'>


<class 'faiss.swigfaiss_avx2.GpuIndexFlat'>


In [13]:
import faiss

# Get FAISS index from LangChain's FAISS wrapper
faiss_index = db.index

# Move the FAISS index to GPU
gpu_resources = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(gpu_resources, 0, faiss_index)  # 0 = GPU ID

# Replace LangChain FAISS index with GPU index
db.index = gpu_index

print("FAISS is now using GPU ✅")


FAISS is now using GPU ✅
