# A Question Answering System backed by a Sparse Retriever

Sparse Retrievers belong to a family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.

**Examples:** BM25, TF-IDF

**Pros:** Simple, fast, well explainable

**Cons:** Relies on exact keyword matches between query and text

## Installing Haystack

To start, let's install the latest release of Haystack with `pip`  
**NOTE:** Skip if already installed.

In [None]:
%%bash

pip install --upgrade pip
pip install greenlet

#For MacOS with M1 silicon
#GRPC_PYTHON_BUILD_SYSTEM_ZLIB=true pip install farm-haystack

#Otherwise
pip install farm-haystack

Set the logging level to INFO:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, we're using the `InMemoryDocumentStore`, which is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

Let's initialize the the DocumentStore:

In [None]:
from haystack.document_stores import InMemoryDocumentStore

in_memory_document_store = InMemoryDocumentStore(use_bm25=True)

The DocumentStore is now ready. Now it's time to fill it with some Documents.

## Preparing Documents

1. Download all the 18 parvas of Mahabharata from https://www.kaggle.com/datasets/tilakd/mahabharata Unzip it and place the .txt file in folder named `data/Mahabharata` under the current working directory.

In [None]:
doc_dir = "data/Mahabharata"

2. Use `TextIndexingPipeline` to convert the files you just downloaded into Haystack [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document) and write them into the DocumentStore:

In [None]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
from haystack.nodes import PreProcessor

# Note that we use PreProcessor to create the document boundaries

files_to_index = [doc_dir + "/1-18 books combined.txt"]
indexing_pipeline = TextIndexingPipeline(in_memory_document_store, preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_overlap=50,
    split_length=500,
    split_respect_sentence_boundary=False
))
indexing_pipeline.run_batch(file_paths=files_to_index)

As an alternative, you can cast you text data into [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document) and write them into the DocumentStore using `DocumentStore.write_documents()`.

## Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question. This tutorial uses the BM25 algorithm. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

Let's initialize a BM25Retriever and make it use the InMemoryDocumentStore we initialized earlier:

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=in_memory_document_store)

The Retriever is ready but we still need to initialize the Reader. 

## Initializing the Reader

A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. In this tutorial, we're using a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a strong all-round model that's good as a starting point. To find the best model for your use case, see [Models](https://haystack.deepset.ai/pipeline_nodes/reader#models).

Let's initialize the Reader:

In [None]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

We've initalized all the components for our pipeline. We're now ready to create the pipeline.

## Creating the Retriever-Reader Pipeline

In this tutorial, we're using a ready-made pipeline called `ExtractiveQAPipeline`. It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on. To learn more about pipelines, see [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines).

To create the pipeline, run:

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

The pipeline's ready, you can now go ahead and ask a question!

## Asking a Question

1. Use the pipeline `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).

In [None]:
prediction = pipe.run(
    query="Who is Krishna?",
    #query="Who is called the divine son of Devaki?",
    #query="Who received Lord Shiva's boon?",
    #query="Who desired to perform the Rajasuya sacrifice?",
    #query="Who asked Ekalavya for his thumb?",
    #query="Why did Drona award the brahmastra weapon to Arjuna?",
    #query="Who had knowledge of Bhargav astra?",
    #query="Who was Bhishma?",
    #query="Who were the sons of Pandu?",
    #query="Who were the other maharathis that were supporting the Pandavas?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 2}
    }
)

2. Print out the answers the pipeline returned:

In [None]:
from pprint import pprint

pprint(prediction)

3. Simplify the printed answers:

In [None]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="medium" ## Choose from `minimum`, `medium`, and `all`
)

And there you have it! Congratulations on building your first machine learning based question answering system!

# Next Steps

Check out [Build a Scalable Question Answering System](https://haystack.deepset.ai/tutorials/03_scalable_qa_system) to learn how to make a more advanced question answering system that uses an Elasticsearch backed DocumentStore and makes more use of the flexibility that pipelines offer.