# BM25 Retriever
In this guide, we define a bm25 retriever that search documents using bm25 method.

This notebook is very similar to the RouterQueryEngine notebook.

## Setup

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [1]:
# %pip install llama-index-llms-openai
# %pip install llama-index-retrievers-bm25

In [2]:
# !pip install llama-index

In [3]:
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio

nest_asyncio.apply()

In [4]:
import os
import openai

# os.environ["OPENAI_API_KEY"] = "sk-..."
# openai.api_key = os.environ["OPENAI_API_KEY"]

In [5]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI

NumExpr defaulting to 10 threads.


## Download Data

In [6]:
!mkdir -p ~/learning/generative_ai/llm/notebook/enhanced_retriever/data
!curl https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt > ~/learning/generative_ai/llm/notebook/enhanced_retriever/data/paul_graham_essay.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   111k      0 --:--:-- --:--:-- --:--:--  111k


## Load Data

We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.

In [11]:
# load documents
documents = SimpleDirectoryReader(input_files=["data/paul_graham_essay.txt"]).load_data()

In [12]:
# initialize LLM + node parser
llm = OpenAI(model="gpt-4o")
splitter = SentenceSplitter(chunk_size=1024)

nodes = splitter.get_nodes_from_documents(documents)

In [13]:
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

In [14]:
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


## BM25 Retriever

We will search document with bm25 retriever.

In [15]:
# We can pass in the index, doctore, or list of nodes to create the retriever
retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)

In [16]:
from llama_index.core.response.notebook_utils import display_source_node

# will retrieve context from specific companies
nodes = retriever.retrieve("What happened at Viaweb and Interleaf?")
for node in nodes:
    display_source_node(node)

**Node ID:** 57923292-a72a-49f6-92e7-1a40514cc4e4<br>**Similarity:** 1.284290506425639<br>**Text:** Notes

[1] My experience skipped a step in the evolution of computers: time-sharing machines with...<br>

**Node ID:** a079209a-9b18-4e21-adcf-6fdf97fe5db5<br>**Similarity:** 1.1042845548871<br>**Text:** I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking ...<br>

In [17]:
nodes = retriever.retrieve("What did Paul Graham do after RISD?")
for node in nodes:
    display_source_node(node)

**Node ID:** 8bf31f1c-1533-45c4-a8a5-2d212fbef734<br>**Similarity:** 5.1818783699170625<br>**Text:** Now they are, though. Now you could continue using McCarthy's axiomatic approach till you'd defin...<br>

**Node ID:** 19225f28-0307-4345-9e7d-3fd21f4a0efe<br>**Similarity:** 1.101230109848415<br>**Text:** Painting students were supposed to express themselves, which to the more worldly ones meant to tr...<br>

## Router Retriever with bm25 method

Now we will combine bm25 retriever with vector index retriever.

In [18]:
from llama_index.core.tools import RetrieverTool

vector_retriever = VectorIndexRetriever(index)
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)

retriever_tools = [
    RetrieverTool.from_defaults(
        retriever=vector_retriever,
        description="Useful in most cases",
    ),
    RetrieverTool.from_defaults(
        retriever=bm25_retriever,
        description="Useful if searching about specific information",
    ),
]

In [19]:
from llama_index.core.retrievers import RouterRetriever

retriever = RouterRetriever.from_defaults(
    retriever_tools=retriever_tools,
    llm=llm,
    select_multi=True,
)

In [23]:
# will retrieve all context from the author's life
nodes = retriever.retrieve(
    "Can you give me all the context regarding the author's life?"
)
for node in nodes:
    display_source_node(node)

HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Selecting retriever 1: The question asks for specific information about the author's life, making this choice the most relevant..


ValidationError: 1 validation error for NodeWithScore
node
  Can't instantiate abstract class BaseNode with abstract methods get_content, get_metadata_str, get_type, hash, set_content (type=type_error)

## Advanced - Hybrid Retriever + Re-Ranking

Here we extend the base retriever class and create a custom retriever that always uses the vector retriever and BM25 retreiver.

Then, nodes can be re-ranked and filtered. This lets us keep intermediate top-k values large and letting the re-ranking filter out un-needed nodes.

To best demonstrate this, we will use a larger set of source documents -- Chapter 3 from the 2022 IPCC Climate Report.

### Setup data

In [21]:
# !curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0   354k      0  0:00:59  0:00:59 --:--:--  389k 326k      0  0:01:05  0:00:17  0:00:48  358k  0:00:44  365k6 17.9M    0     0   350k      0  0:01:00  0:00:52  0:00:08  364k


In [24]:
# !pip install pypdf

In [25]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    SimpleDirectoryReader,
    Document,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI

# load documents
documents = SimpleDirectoryReader(
    input_files=["data/IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

In [26]:
# initialize llm + node parser
# -- here, we set a smaller chunk size, to allow for more effective re-ranking
llm = OpenAI(model="gpt-4o")
splitter = SentenceSplitter(chunk_size=256)
# limit to a smaller section
nodes = splitter.get_nodes_from_documents(
    [Document(text=documents[0].get_content()[:1000000])]
)

In [27]:
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

In [28]:
index = VectorStoreIndex(nodes, storage_context=storage_context)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [29]:
from llama_index.retrievers.bm25 import BM25Retriever

# retireve the top 10 most similar nodes using embeddings
vector_retriever = index.as_retriever(similarity_top_k=10)

# retireve the top 10 most similar nodes using bm25
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=10)

### Custom Retriever Implementation

In [30]:
from llama_index.core.retrievers import BaseRetriever


class HybridRetriever(BaseRetriever):
    def __init__(self, vector_retriever, bm25_retriever):
        self.vector_retriever = vector_retriever
        self.bm25_retriever = bm25_retriever
        super().__init__()

    def _retrieve(self, query, **kwargs):
        bm25_nodes = self.bm25_retriever.retrieve(query, **kwargs)
        vector_nodes = self.vector_retriever.retrieve(query, **kwargs)

        # combine the two lists of nodes
        all_nodes = []
        node_ids = set()
        for n in bm25_nodes + vector_nodes:
            if n.node.node_id not in node_ids:
                all_nodes.append(n)
                node_ids.add(n.node.node_id)
        return all_nodes

In [31]:
index.as_retriever(similarity_top_k=5)

hybrid_retriever = HybridRetriever(vector_retriever, bm25_retriever)

### Re-Ranker Setup

In [32]:
# !pip install sentence-transformers

In [34]:
from llama_index.core.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(top_n=4, model="BAAI/bge-reranker-base")

### Retrieve

In [35]:
from llama_index.core import QueryBundle
from llama_index.core.schema import NodeWithScore

retrieved_nodes = hybrid_retriever.retrieve(
    "What is the impact of climate change on the ocean?"
)
reranked_nodes = reranker.postprocess_nodes(
    retrieved_nodes,
    query_bundle=QueryBundle(
        "What is the impact of climate change on the ocean?"
    ),
)

print("Initial retrieval: ", len(retrieved_nodes), " nodes")
print("Re-ranked retrieval: ", len(reranked_nodes), " nodes")

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Batches: 100%|██████████| 1/1 [00:00<00:00,  1.63it/s]


Initial retrieval:  13  nodes
Re-ranked retrieval:  4  nodes


In [36]:
from llama_index.core.response.notebook_utils import display_source_node

for node in reranked_nodes:
    display_source_node(node)

**Node ID:** b921ca47-329e-45ee-95e4-03a16e753aff<br>**Similarity:** 0.0009219407220371068<br>**Text:** Ghebrehiwet, S.-I.  Ito, W.  Kiessling, P .  Martinetto, E.  Ojea, 
M.-F . Racault, B.  Rost, and...<br>

**Node ID:** 1d2ed9ab-cd19-4f74-b862-c2dc2f20143a<br>**Similarity:** 0.0006790422485210001<br>**Text:** SPM379
3
Oceans and Coastal 
Ecosystems and Their Services
This chapter should be cited as:
Coole...<br>

**Node ID:** 92bf3c5a-71bb-4fba-9b92-02aa4e4b476a<br>**Similarity:** 0.0006047315546311438<br>**Text:** In: Climate 
Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II...<br>

**Node ID:** f5ff880a-dac6-4b8f-95ad-08b40926dc35<br>**Similarity:** 0.0004132408066652715<br>**Text:** Helen 
Gurney-Smith (Canada), Marjolijn Haasnoot (The Netherlands), Rebecca Harris (Australia), S...<br>

### Full Query Engine 

In [37]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
    llm=llm,
)

response = query_engine.query(
    "What is the impact of climate change on the ocean?"
)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Batches: 100%|██████████| 1/1 [00:00<00:00, 54.28it/s]


HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [38]:
from llama_index.core.response.notebook_utils import display_response

display_response(response)

**`Final Response:`** Climate change significantly impacts oceans and coastal ecosystems, affecting their services. These impacts include rising sea temperatures, ocean acidification, and changes in marine biodiversity. These changes can disrupt marine food webs, alter fish populations, and affect the livelihoods of communities dependent on marine resources. Additionally, the increased frequency and intensity of extreme weather events can damage coastal habitats and infrastructure. Adaptation and mitigation strategies are essential to address these challenges and protect ocean and coastal ecosystems.