# Elastic vs Dense Vector Search

---

Here we load a BM25 retreiver and a simple Embedding retriever, and we measure the MRR on a question set generated from an Excel procedure manual. We then demonstrate a fusion retriever that combines the two retrievers into one.

This notebook shows that we are not confined to a single embedding model, or single type of model. There are many situations where keyword search might be more appropriate.

We also introduce reranking that in Llama_index is currently implemented with [Recipricol Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf). These allow us to group together a number of retrievers , so we can include BM25. Reranking is not recommend unless the set up is intelligently implemented, that said reranking can potentially provide an ensemble solution, that could be SOTA.


## $\color{blue}{Sections:}$
* Admin
* Data
* Models
* Dataset
* Finetune
* Evaluation

---
## $\color{blue}{Admin}$
---

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
%%capture
!pip install openai llama_index pypdf -q -U

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI password: ")

In [None]:
from google.colab import drive

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

---
## $\color{blue}{Data}$
---

The train data is a MS excel pdf guide to new version of Excel 2010, approx 80 pages.

The valid data is a University issue how to guide for MS Excel.

Get train and validation nodes.

In [None]:
from llama_index.core import SimpleDirectoryReader

In [None]:
train_reader = SimpleDirectoryReader(
    input_files =["RAG_tutorial/Data/excel_train.pdf"]
)

In [None]:
valid_reader = SimpleDirectoryReader(
    input_files =["RAG_tutorial/Data/excel_valid.pdf"]
)

In [None]:
train_data = train_reader.load_data()

In [None]:
valid_data = valid_reader.load_data()

In [None]:
from llama_index.core.node_parser import SimpleNodeParser

In [None]:
parser = SimpleNodeParser(chunk_size=500, chunk_overlap=20)
train_nodes = parser.get_nodes_from_documents(train_data, show_progress=True)
valid_nodes = parser.get_nodes_from_documents(valid_data, show_progress=True)

Parsing nodes:   0%|          | 0/76 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/60 [00:00<?, ?it/s]

In [None]:
print('Train: ', len(train_nodes))
print('Valid: ', len(valid_nodes))

Train:  83
Valid:  73


---
## $\color{blue}{Models}$
---

We relied on Zephyr 7B to produce the questions from the corpus of data for the test, the questions have been developed and saved to file, so they can be directly loaded. Otherwise loading the module to connect to the hugging face API takes a long time. As such, we do not reload any LLM for these tests.



---
### $\color{red}{Embeddor}$
---

To implement with another embedding model follow these steps.

The embedding model can be kept to the default, OpenAI ada-002

In [None]:
from llama_index.core import Settings

In [None]:
Settings.embed_model

Or we can use another model, like a previously finetuned model

In [None]:
from huggingface_hub import login
import os

In [None]:
HF_TOKEN = getpass.getpass('Hugging Face token please:')

In [None]:
login(token=HF_TOKEN)
os.environ['HUGGINGFACEHUB_API_TOKEN'] = HF_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
%%capture
%pip install llama-index-embeddings-huggingface

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [None]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
Settings.embed_model = embed_model

In [None]:
Settings.embed_model

HuggingFaceEmbedding(model_name='BAAI/bge-m3', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x79ad23329ab0>, num_workers=None, max_length=8192, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

---
## $\color{blue}{Dataset}$
---

The evaluation tasks consists of measuring the embedding model's ability to find a relevant section of a document for some summary text.

1. Document split into chunks
2. Questions generated from the chucks
3. Embed model tries identify the chunck in the document used to create the question

We now use an LLM to create the questions.

In [None]:
from llama_index.legacy.finetuning import generate_qa_embedding_pairs, EmbeddingQAFinetuneDataset

[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/legacy/_static/nltk_cache...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/legacy/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Here we dispense of the creation of the dataset because it has been previously saved to file

In [None]:
#train_dataset = generate_qa_embedding_pairs(train_nodes, llm=llm)
#train_dataset.save_json("RAG_tutorial/Data/train_dataset.json")

In [None]:
#valid_dataset = generate_qa_embedding_pairs(valid_nodes, llm=llm)
#valid_dataset.save_json("RAG_tutorial/Data/valid_dataset.json")

**Structure**

The resultant structure allows a lookup, for the questions on the documents.

queries = {hash: question}

corpus = {hash: corpus}

relevant_docs = {hash_question : [hash_corpus]}

So given a question from queries, i can extract the document that it was made from.

In [None]:
train_dataset = EmbeddingQAFinetuneDataset.from_json("RAG_tutorial/Data/train_dataset.json")

In [None]:
valid_dataset = EmbeddingQAFinetuneDataset.from_json("RAG_tutorial/Data/valid_dataset.json")

In [None]:
dict(train_dataset).keys()

dict_keys(['queries', 'corpus', 'relevant_docs', 'mode'])

In [None]:
for key in list(train_dataset.queries.keys())[3:8]:
  print(train_dataset.queries[key], '\n')

How has the functionality of PivotTable View been improved in Excel 2010? 

What are the new features introduced in Excel Services and how do they improve the functionality of Excel? 

What is Protected View and how does it enhance the security of Excel documents? 

What are the new and enhanced features of Microsoft Excel 2010 that make it possible to analyze, manage, and share information in more ways than ever before? 

How can you easily share your insights with others through Microsoft SharePoint 2010 or your Windows Live account using Excel 2010? 



In [None]:
train_dataset.queries

In [None]:
train_dataset.corpus

In [None]:
train_dataset.corpus.keys()

dict_keys(['9488e67d-84b2-4693-b8b2-9cf9923d0969', '80c89b9a-744e-4419-999c-498430c84d98', '861b2a3d-ba31-4d68-9421-13c3e557d197', '9c95d7ad-99b6-4b14-b193-a8ff0c2f09a0', '918f0ff0-652c-4eb2-a5e2-474c0373504d', '3972807d-5e47-4452-a31d-327b967053f0', '7c35919c-db98-4899-a17d-1b1b990f3e6f', '9650b3e1-b5be-42af-a41c-bb3d0d89b6f0', '72144f8d-c352-47d7-86fb-5456a172a245', '9e1c74a9-5625-4894-8ea4-b433c85f92f4', '3c91f6f9-2e4b-415d-b901-aa69e4600ac0', 'b3078ef6-d64b-4b19-a3c7-0a0a0c4b50c9', '16dd007c-b615-47bb-a79a-3ef101345a46', 'efcf38f2-38d2-4bc4-b315-bc5666ad9ca3', '2823abeb-ccc6-405f-8ba7-7754a26da520', '6c1e35a8-4eb1-402e-bdca-86350f043a6a', 'c272807c-0f5a-4633-b689-b89e52a1681a', '513ac29e-8782-4047-814b-f682cb9c6605', 'dc52fbbe-b091-4a21-b0a2-413b149094b1', 'e98fb931-32c3-4d9f-b42b-c230eaa4372e', '89f1c70c-3bab-4038-a339-f3163c78f14b', '48197ea4-e62c-4950-ba46-997cca939378', '1135f236-a5ec-4a50-9414-5f35baccb81e', 'bf5ed9f5-df3c-4142-9087-6b0becc0350c', 'c6b19891-84a6-40d1-b65f-255b

In [None]:
for key in list(train_dataset.corpus.keys())[3:8]:
  print(' '.join(train_dataset.corpus[key].split()), '\n')

1 Microsoft Excel 2010: An Overview Microsoft® Excel® 2010 delivers rich, new and enhanced features to the world’s most popular productivity suite . Excel 2010 makes it possible to analyze, manage, and share information in more ways than ever before, helping you make better, smarter decisions . With new data analysis and visualization tools, along with managed self -service business intelligence technologies , you can create effective business or information insights that track and highlight important data trends and communicate your results thro ugh h igh-quality charts and graphs. You can also easily share your insights with others through Microsoft SharePoint ® 2010 or your Windows Live ™ account. Work better together by working simultaneously with others online and accomplish your most important t asks faster. Your information is never far away as you can access your files from almost anywhere —from your P C, a Web browser , or smartphone.1 With Excel 2010 you can work when and whe

In [None]:
len(train_dataset.corpus)

In [None]:
# for a query hash, this is the doc hash that relates
for key in list(train_dataset.queries.keys())[3:8]:
  print(train_dataset.relevant_docs[key], '\n')

['80c89b9a-744e-4419-999c-498430c84d98'] 

['861b2a3d-ba31-4d68-9421-13c3e557d197'] 

['861b2a3d-ba31-4d68-9421-13c3e557d197'] 

['9c95d7ad-99b6-4b14-b193-a8ff0c2f09a0'] 

['9c95d7ad-99b6-4b14-b193-a8ff0c2f09a0'] 



---
## $\color{blue}{Evaluation}$
---
We evaluate using MRR..

$MRR = \frac{1}{|Q|} \sum_{i=1}^{Q} \frac{1}{rank_i}$

Where $Q$ is the set of queries. Where we look at a query, and return the recipricol of the rank of the ground truth in the recommended list (ie. ground truth is 3rd in list $RR = \frac{1}{3}$). Then take the average over all queries.


In [None]:
Settings.embed_model

HuggingFaceEmbedding(model_name='BAAI/bge-m3', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x79ad23329ab0>, num_workers=None, max_length=8192, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

We need to create an index, from the same nodes used to create the questions dataset, so we can search over it.

In [None]:
from llama_index.core import VectorStoreIndex

In [None]:
index = VectorStoreIndex(train_nodes)

Now lets define two retriever objects. BM25 for keyword type search and vector for dense vector retrieval

In [None]:
%%capture
%pip install llama-index-retrievers-bm25

In [None]:
from llama_index.retrievers.bm25 import BM25Retriever

In [None]:
vector_retriever = index.as_retriever(similarity_top_k=5)

bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=5
)

In [None]:
vector_retriever._embed_model

This evaluation function calculates the MRR

In [None]:
import numpy as np

In [None]:
def calc_mrr(retriever, dataset):
  total = []
  # loop every query
  for query_hash, query in dataset.queries.items():
    corpus_hash = train_dataset.relevant_docs[query_hash][0] #collect the hash of corpus from which it was made
    true_text = train_dataset.corpus[corpus_hash] #get the string from which it was made

    results = retriever.retrieve(query)

    recipricol_rank = 0
    for i in range(len(results)):
      if results[i].text == true_text:
        recipricol_rank = 1/(i+1)

    total.append(recipricol_rank)

  return np.mean(total)


In [None]:
mrr = calc_mrr(bm25_retriever, train_dataset)
print('BM25')
print('MRR: ', mrr)

BM25
MRR:  0.7763063063063064


In [None]:
mrr = calc_mrr(vector_retriever, train_dataset)
print('Dense')
print('MRR: ', mrr)

Dense
MRR:  0.6281981981981982


We can combine the two with and apply a rerank algorithm to the results. This method takes 1/(60 + rank) for each of the models used. The problem is that bad models are too influenctial. I would suggest not using fusion rerankings, unless there are a large pool of embedding models, or else there is a specific weighting applied to weaker models.

It would seem safer to test numerous models and take the most powerful. Using averaging isn't the best when there is a sample size of 2. Here we might have a very weak model who ranks a query first.

A nice idea would be to have a decent size of models and then use gradient boosting to configure when we could rely on models. I'm not sure how compatible this would be, we would have to create a class that inherits from Query Fusion Retriever, and this would involve a lot of work, but it probably is not beyond the realms of possibility.

Gradient bosted reranking, is an idea that could potentially lead to SOTA, against methods that now rely on simple or weighted averaging.

In [None]:
from llama_index.core.retrievers import QueryFusionRetriever

In [None]:
retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=True,
    verbose=True,
    # query_gen_prompt="...",  # we could override the query generation prompt here
)

In [None]:
mrr = calc_mrr(retriever, train_dataset)
print('Both')
print('MRR: ', mrr)

Both
MRR:  0.746936936936937
