# BM25 Retriever
In this guide, we define a bm25 retriever that search documents using bm25 method.

This notebook is very similar to the RouterQueryEngine notebook.

### Setup

In [1]:
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio

nest_asyncio.apply()

In [38]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import (
    ListIndex,
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
    SimpleKeywordTableIndex,
    VectorStoreIndex,
)
from llama_index.retrievers import BM25Retriever
from llama_index.indices.vector_store.retrievers.retriever import VectorIndexRetriever
from llama_index.llms import OpenAI

### Load Data

We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.

In [14]:
# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()

In [15]:
# initialize service context (set chunk size)
llm = OpenAI(model="gpt-4")
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm)
nodes = service_context.node_parser.get_nodes_from_documents(documents)

In [16]:
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

In [22]:
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    service_context=service_context,
)

### BM25 Retriever

We will search document with bm25 retriever.

In [25]:
from llama_index.utils import globals_helper

retriever = BM25Retriever(index, globals_helper.tokenizer)

In [35]:
from llama_index.response.notebook_utils import display_source_node

# will retrieve all context from the author's life
nodes = retriever.retrieve(
    "Can you give me all the context regarding the author's life?"
)
for node in nodes:
    display_source_node(node)

**Node ID:** 352371f3-1031-424f-87f0-3efadc6409c9<br>**Similarity:** 7.322087057741413<br>**Text:** This name didn't last long before it was replaced by "software as a service," but it was current ...<br>

**Node ID:** b6244135-d9c7-43e7-baae-7b8357a33d2f<br>**Similarity:** 7.121406119821665<br>**Text:** There, right on the wall, was something you could make that would last.Paintings didn't become ob...<br>

In [37]:
nodes = retriever.retrieve("What did Paul Graham do after RISD?")
for node in nodes:
    display_source_node(node)

**Node ID:** 6f24fb65-6604-41da-b487-0f5ddc7ba66c<br>**Similarity:** 6.84581665569314<br>**Text:** That seemed unnatural to me, and on this point the rest of the world is coming around to my way o...<br>

**Node ID:** b9e0c4b2-8c64-484a-875c-1cca6a463226<br>**Similarity:** 6.259657209129812<br>**Text:** So I decided to take a shot at it.It took 4 years, from March 26, 2015 to October 12, 2019.It was...<br>

### Hybrid Retriever with bm25 method

Now we will combine bm25 retriever with vector index retriever.

In [57]:
from llama_index.tools import RetrieverTool

vector_retriever = VectorIndexRetriever(index)
bm25_retriever = BM25Retriever(index, globals_helper.tokenizer)

retriever_tools = [
    RetrieverTool.from_defaults(
        retriever=vector_retriever,
        description="Useful in most cases",
    ),
    RetrieverTool.from_defaults(
        retriever=bm25_retriever,
        description="Useful if searching about specific information",
    ),
]

In [58]:
from llama_index.selectors.pydantic_selectors import PydanticMultiSelector
from llama_index.retrievers import RouterRetriever

retriever = RouterRetriever.from_defaults(
    retriever_tools=retriever_tools,
    service_context=service_context,
    select_multi=True,
)

In [59]:
# will retrieve all context from the author's life
nodes = retriever.retrieve(
    "Can you give me all the context regarding the author's life?"
)
for node in nodes:
    display_source_node(node)

Selecting retriever 0: This choice is relevant because it suggests that the information provided will be useful in most cases, which could include providing context about the author's life..


**Node ID:** b6244135-d9c7-43e7-baae-7b8357a33d2f<br>**Similarity:** 0.7840662826916402<br>**Text:** There, right on the wall, was something you could make that would last.Paintings didn't become ob...<br>

**Node ID:** f9453662-1e69-4138-aa90-8d472bca83e6<br>**Similarity:** 0.7817517935553296<br>**Text:** The students and faculty in the painting department at the Accademia were the nicest people you c...<br>