# Hybrid Retrieval Pipeline
Combining the keyword-based and embedding-based retrieval techniques, leveraging the strenghts of both approaches. 

Keyword-Based (TFDF, BM25) is the sparse retrieval which excels in matching keywords.

Embedding-Based is the dense embeddings, which excel in grasping the contextual nuances of the query.

In [25]:
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

#### Fetching and Processing Documents
Using the PubMed Abstract Papers

In [26]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("anakin87/medrag-pubmed-chunk", split="train")

docs = []
for doc in dataset:
    docs.append(
        Document(content=doc["contents"], meta={
            "title": doc["title"],
            "abstract": doc["content"],
            "pmid": doc["id"]
        })
    )

#### Indexing Documents with Pipeline
We then create a pipeline to store the data in the document store with their embedding.


The pipeline we will use a DocumentSplitter to split documents into chunks of 512 words. We use 512 here because this is the token limit of the model BAAI/bge-small-end-v1.5. Update the limit based on the model you are using.

In [27]:
from haystack.components.writers import DocumentWriter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack import Pipeline
from haystack.utils import ComponentDevice

document_splitter = DocumentSplitter(split_by="word", split_length=512, split_overlap=32 )
document_embedder = SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0")
)
document_writer = DocumentWriter(document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("document_splitter", document_splitter)
indexing_pipeline.add_component("document_embedder", document_embedder)
indexing_pipeline.add_component("document_writer", document_writer)

indexing_pipeline.connect("document_splitter", "document_embedder")
indexing_pipeline.connect("document_embedder", "document_writer")

indexing_pipeline.run({
    "document_splitter": {
        "documents": docs
    }
})

Batches: 100%|██████████| 481/481 [00:35<00:00, 13.40it/s]


{'document_writer': {'documents_written': 15380}}

#### Creating a Pipeline for Hybrid Retrieval

Hybdrid Retrieval refers to the combination of multiple retrieval methods to enhance overall performance. 

Keyword Based Search + Dense Vector Search -> Rerank Results using cross-encoder model

#### -> Initialise Retrievers and the Embedder

For this example, the InMemoryEmbeddingRetriever and the InMemoryBM25Retriever will be used to perform both dense and keyword-based retrieval. 

For a dense retrival, we need to use a SentenceTransformersTextEncoder that computes the embedding of the search query by using the same model we used for indexing the documents.

In [28]:
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder

text_embedder = SentenceTransformersTextEmbedder(
    model="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0")
)
embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)

#### -> Join Retrieval Results

Use a DocumentJoiner to join these documents, joining methods such as `merge`, `reciprocal_rank_fusion` and `concatenate`. We will go with the default option `concatenate`

In [29]:
from haystack.components.joiners import DocumentJoiner

document_joiner = DocumentJoiner()

#### -> Rank the Results
Use the `TransformersSimilarityRanker` which scores the relevancy of all retrieved-documents for the given search query by using a cross encoder model. We will use the BAA1/bge-reranker-base model to rank the retrieved documents, we can replace this with any `cross-encoder` models on HuggingFace https://huggingface.co/cross-encoder

In [30]:
from haystack.components.rankers import TransformersSimilarityRanker

ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base")

#### -> Create the Hybrid Retrieval Pipeline
Add all initialized components to your pipeline and connect them

In [31]:
from haystack import Pipeline

hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("document_joiner", document_joiner)
hybrid_retrieval.add_component("ranker", ranker)

hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect("bm25_retriever", "document_joiner")
hybrid_retrieval.connect("embedding_retriever", "document_joiner")
hybrid_retrieval.connect("document_joiner", "ranker")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7425795d01d0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: TransformersSimilarityRanker
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (List[float])
  - embedding_retriever.documents -> document_joiner.documents (List[Document])
  - bm25_retriever.documents -> document_joiner.documents (List[Document])
  - document_joiner.documents -> ranker.documents (List[Document])

#### -> Visualise the Pipeline (Optional)

Helps us to understand how a hybrid retrieval pipeline is formed. use the draw method

In [32]:
hybrid_retrieval.draw("hybrid-retrieval.png")

### Testing Hybrid Retrieval

In [33]:
query = "apnea in infants"

result = hybrid_retrieval.run({
    "text_embedder": {
        "text": query
    }, 
    "bm25_retriever": {
        "query": query
    },
    "ranker": {
        "query": query
    }
})

Batches: 100%|██████████| 1/1 [00:00<00:00, 106.58it/s]


##### Pretty Print Results

In [34]:
def pretty_print_results(prediction):
    for doc in prediction['documents']:
        print(doc.meta['title'], "\t", doc.score)
        print(doc.meta['abstract'])
        print('\n', '\n')
        
pretty_print_results(result['ranker'])

Physiologic changes induced by theophylline in the treatment of apnea in preterm infants. 	 0.9714500308036804
Ten preterm infants (birth weight 0.970 to 2.495 kg) with apnea due to periodic breathing (apneic interval = 5 to 10 seconds) or with "serious apnea" (greater than or equal to 20 seconds) were studied before and after the administration of theophylline. We determined the incidence of apnea, respiratory minute volume, alveolar gases, arterial gases and pH, "specific" compliance, functional residual capacity, and work of breathing. Theophylline decreased the incidence of apnea (P less than .05), increased respiratory minute volume (P less than 0.001), decreased (PACO2 (and PaCO2 P less than 0.001), increased the slope of the CO2 response curve (P less than 0.02) with a significant shift to the left (P less than 0.02). These findings suggest that the decreased incidence of apnea after theophylline is associated with an increase in alveolar ventilation and increased sensitivity to