# [LlamaIndex] Hybrid Search with QueryFusionRetriever
- Writer : Woocheol Cho https://www.linkedin.com/in/woocheolcho/
- Date : 2024.07.22

## 0. LlamaIndex Installation & Register OpenAI API Key

In [4]:
!pip install openai



In [1]:
!pip install llama-index  #LlamaIndex intallation
!pip install llama-index-retrievers-bm25==0.1.4  #LlamaIndex bm25 retriever intallation

Collecting llama-index
  Downloading llama_index-0.10.58-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.9-py3-none-any.whl.metadata (729 bytes)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.13-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core==0.10.58 (from llama-index)
  Downloading llama_index_core-0.10.58-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.11-py3-none-any.whl.metadata (655 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.2.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.2.7-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48-py3-none-any.whl.metadata (8.5 kB)
Collecting llama-ind

In [2]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [6]:
import os
import openai

# os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
# openai.api_key = os.environ["OPENAI_API_KEY"]
openai.api_key = OPENAI_API_KEY

## 1. Comparison of Vector(Dense) Retriever and BM25(Sparse) Retriever
- Reference : - https://docs.llamaindex.ai/en/stable/examples/vector_stores/postgres/?h=use_async

In [7]:
from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.response.notebook_utils import display_source_node


documents = [
    Document(text="Add a customer"),
    Document(text="Add location"),
    Document(text="Add registration"),
    Document(text="Registration is bull sh*t"),
    Document(text="Location of today's Sunnyvale California is awesome!")
]


index = VectorStoreIndex.from_documents(documents)

In [8]:
vector_retriever = index.as_retriever(similarity_top_k=3)
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore,
    similarity_top_k=3)

### 1-1. Vector retriever

In [10]:
nodes_with_scores = vector_retriever.retrieve("I want to open a new store.")

for node in nodes_with_scores:
    display_source_node(node)

**Node ID:** 54c91689-4da9-4dce-a0b6-5ed4699b51ae<br>**Similarity:** 0.7903476880223865<br>**Text:** Add a customer<br>

**Node ID:** 12a88c65-0887-404b-a10e-0c2ca6c2d341<br>**Similarity:** 0.7855751579853391<br>**Text:** Add location<br>

**Node ID:** f37a07a7-c100-4cac-8c77-27b8c497c02f<br>**Similarity:** 0.7536020551414576<br>**Text:** Add registration<br>

###1-2. BM25 retriever

In [12]:
nodes_with_scores = bm25_retriever.retrieve("I want to open a new store.")

for node in nodes_with_scores:
    display_source_node(node)

**Node ID:** ba0b545f-6e37-4db6-a46d-75870f5eb02a<br>**Similarity:** 0.0<br>**Text:** Location of today's Sunnyvale California is awesome!<br>

**Node ID:** 36994c40-d29f-477d-bc15-a2cad75ad361<br>**Similarity:** 0.0<br>**Text:** Registration is bull sh*t<br>

**Node ID:** f37a07a7-c100-4cac-8c77-27b8c497c02f<br>**Similarity:** 0.0<br>**Text:** Add registration<br>

## 2. Comparison of Hybrid Retrievers

### 2-1. Simple fusion hybrid retriever
- Simple Fusion sorts nodes based on the original similarity scores calculated by each Retriever, without any special processing. For duplicate nodes, it adopts the highest score, and then sorts the final results in descending order using these scores, without any additional score adjustments.

In [13]:
simple_hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=3,
    num_queries=1,  # set this to 1 to disable query generation
    mode="simple",
    use_async=False,
    verbose=True,
)

nodes_with_scores = simple_hybrid_retriever.retrieve("I want to open a new store.")

for node in nodes_with_scores:
    display_source_node(node)




**Node ID:** 54c91689-4da9-4dce-a0b6-5ed4699b51ae<br>**Similarity:** 0.7903476880223865<br>**Text:** Add a customer<br>

**Node ID:** 12a88c65-0887-404b-a10e-0c2ca6c2d341<br>**Similarity:** 0.7855751579853391<br>**Text:** Add location<br>

**Node ID:** f37a07a7-c100-4cac-8c77-27b8c497c02f<br>**Similarity:** 0.7536020551414576<br>**Text:** Add registration<br>

### 2-2. Relative score fusion hybrid retriever
- Relative Score Fusion normalizes the results from each Retriever, adjusting scores to a range between 0 and 1. It then applies Retriever weights and divides by the number of queries to balance the scores. For duplicate nodes, it sums the adjusted scores. The final results are sorted in descending order based on these aggregated scores.

In [14]:
relative_score_hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=3,
    num_queries=1,  # set this to 1 to disable query generation
    mode="relative_score",
    use_async=False,
    verbose=True,
)

nodes_with_scores = relative_score_hybrid_retriever.retrieve("I want to open a new store.")

for node in nodes_with_scores:
    display_source_node(node)



**Node ID:** 54c91689-4da9-4dce-a0b6-5ed4699b51ae<br>**Similarity:** 0.6<br>**Text:** Add a customer<br>

**Node ID:** 12a88c65-0887-404b-a10e-0c2ca6c2d341<br>**Similarity:** 0.5220718818068131<br>**Text:** Add location<br>

**Node ID:** f37a07a7-c100-4cac-8c77-27b8c497c02f<br>**Similarity:** 0.0<br>**Text:** Add registration<br>

### 2-3. Relative score fusion(dist based) hybrid retriever
- When the dist_based parameter is True, Relative Score Fusion normalizes the results from each Retriever using mean and standard deviation. Scores are adjusted to within three standard deviations of the mean. It then applies Retriever weights and divides by the number of queries to balance the scores. For duplicate nodes, it sums the adjusted scores. The final results are sorted in descending order based on these aggregated scores.

In [None]:
dist_based_score_hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=3,
    num_queries=1,  # set this to 1 to disable query generation
    mode="dist_based_score",
    use_async=False,
    verbose=True,
)

nodes_with_scores = dist_based_score_hybrid_retriever.retrieve("It is wet outside")

for node in nodes_with_scores:
    display_source_node(node)




**Node ID:** f0c35cee-3f74-444c-b97e-c7bd3f0f9300<br>**Similarity:** 0.4815599137259842<br>**Text:** The streets are getting wet<br>

**Node ID:** 7ac23360-6034-4876-ae30-deeeeaf7791c<br>**Similarity:** 0.40584631457412396<br>**Text:** The weather outside is clear<br>

**Node ID:** 9d6644fe-5896-4c4a-a3d3-0f4666b3a95b<br>**Similarity:** 0.365453319620788<br>**Text:** It's a rainy day<br>

### 2-4. Reciprocal rerank hybrid retriever
- Reciprocal Rank Fusion scores the results from each Retriever based on their ranks. Each item's score is calculated using the formula 1/(rank + k), where k is set to 60. For duplicate nodes, it sums the scores obtained from each Retriever. The final results are sorted in descending order based on these aggregated scores. This method emphasizes items that consistently rank high across multiple Retrievers.

In [None]:
reciprocal_rerank_hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=3,
    num_queries=1,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=False,
    verbose=True,
)

nodes_with_scores = reciprocal_rerank_hybrid_retriever.retrieve("It is wet outside")

for node in nodes_with_scores:
    display_source_node(node)



**Node ID:** f0c35cee-3f74-444c-b97e-c7bd3f0f9300<br>**Similarity:** 0.03279569892473118<br>**Text:** The streets are getting wet<br>

**Node ID:** 7ac23360-6034-4876-ae30-deeeeaf7791c<br>**Similarity:** 0.03279569892473118<br>**Text:** The weather outside is clear<br>

**Node ID:** 9d6644fe-5896-4c4a-a3d3-0f4666b3a95b<br>**Similarity:** 0.01639344262295082<br>**Text:** It's a rainy day<br>