# A Question Answering System backed by a Dense Retriever

Dense Retrievers use neural network models to create “dense” embedding vectors. Within this family, there are two different approaches:
  
  a) Single encoder: Use a single model to embed both the query and the passage.  
  b) Dual-encoder: Use two models, one to embed the query and one to embed the passage.  

**Examples:** REALM, DPR, Sentence-Transformers

**Pros:** Captures semantic similarity instead of “word matches” (for example, synonyms, related topics).

**Cons:** Computationally more heavy to use, initial training of the model (though this is less of an issue nowadays as many pre-trained models are available and most of the time, it’s not needed to train the model).

## A note on Embedding Retrieval

We are going to use an EmbeddingRetriever with Sentence Transformers models.

These models are trained to embed similar sentences close to each other in a shared embedding space.

Some models have been fine-tuned on massive Information Retrieval data and can be used to retrieve documents based on a short query (for example, multi-qa-mpnet-base-dot-v1). There are others that are more suited to semantic similarity tasks where you are trying to find the most similar documents to a given document (for example, all-mpnet-base-v2). There are even models that are multilingual (for example, paraphrase-multilingual-mpnet-base-v2). 

## Installing Haystack

To start, let's install the latest release of Haystack with `pip`  
**NOTE** Skip if already installed.

In [None]:
%%bash

pip install --upgrade pip
pip install greenlet
GRPC_PYTHON_BUILD_SYSTEM_ZLIB=true pip install farm-haystack

Set the logging level to INFO:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Check if FAISS package is available

As we are going to use FAISS (Facebook AI Similarity Search), we need to ensure its packages are available.  
Read more about [FAISS] https://ai.facebook.com/tools/faiss/


In [None]:
%%bash
pip list | grep faiss 

The above cell will give an error if FAISS is not installed. If so, install it by executing the next cell.

In [None]:
%%bash
pip install faiss-cpu

## Check if SQLAlchemy is the correct version

There is an open issue w.r.t. an incorrect version of SQLAlchemy that gets pulled-in as part of a dependency resolution. Please check for presence of version `1.4.47` else force reinstall it.

In [None]:
%%bash
pip list | grep SQLAlchemy 

The above cell will list the version of currently installed SQLAlchemy package. If it's not 1.4.47, force reinstall it by executing the next cell.

In [None]:
%%bash
pip install --force-reinstall -v "SQLAlchemy==1.4.47"

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).


### Create a new document store

If you have never run this notebook earlier, create a new document store. 

**NOTE:** If you already have a document store and FAISS index created, skip this and intermediate steps and jump directly to section _Re-run with an existing document store_

In [None]:
import faiss
from haystack.document_stores import FAISSDocumentStore

sql_url="sqlite:///mahabharata.db"

sqlite_document_store = FAISSDocumentStore(sql_url = sql_url, faiss_index_factory_str="Flat", return_embedding=True)

### Preparing Documents

1. Download all the 18 parvas of Mahabharata from https://www.kaggle.com/datasets/tilakd/mahabharata Unzip it and place the .txt file in folder named `data/Mahabharata` under the current working directory.

In [None]:
doc_dir = "data/Mahabharata"

2. Convert the raw text file to documents

In [None]:
from haystack.utils import clean_wiki_text, convert_files_to_docs

files_to_docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

3. Use a PreProcessor to split the raw doucments to multiple documents with given configuration.

    *Note from Haystack:* Dense Retrievers are limited in the length of text that they can read in one pass. As such, it is important that Documents are not longer than the dense Retriever's maximum input length. By default, Haystack's DensePassageRetriever model has a maximum length of 256 tokens. As such, we recommend that Documents contain significantly less words. We have found decent performance with Documents around 100 words long.

In [None]:
from haystack.nodes import PreProcessor

# Use PreProcessor to create the document boundaries
processor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
    split_overlap=0
)

docs = processor.process(files_to_docs)

4. Save the processed documents to document store.

In [None]:
sqlite_document_store.write_documents(docs)

5. Initialize a EmbeddingRetriever

In [None]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=sqlite_document_store,
    embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
)


6. Update the embeddings to iterate over previously indexed documents and update their embedding representation.

In [None]:
# Important:
# Now that we initialized the Retriever, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation.
# While this can be a time consuming operation (depending on the corpus size), it only needs to be done once.
# At query time, we only need to embed the query and compare it to the existing document embeddings, which is very fast.
sqlite_document_store.update_embeddings(retriever)

7. Save the updated FAISS index

In [None]:

# Save the FAISS index
sqlite_document_store.save("mahabharata_faiss")

### Re-run with an existing document store

If you ran the notebook earlier and already have the SQLite database with corresponding embeddings available, load it and initialize the EmbeddingRetriever. If you just created the embeddings, skip this step as the document store and EmbeddingRetriever are already loaded and ready to go.

In [None]:
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever

# If FAISS is generated then load FAISS document store
sqlite_document_store = FAISSDocumentStore.load(index_path="mahabharata_faiss", config_path="mahabharata_faiss.json")
assert sqlite_document_store.faiss_index_factory_str == "Flat"

retriever = EmbeddingRetriever(
    document_store=sqlite_document_store,
    embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
)


## Initializing the Reader

Here we use a FARMReader with the deepset/roberta-base-squad2 model

In [None]:
from haystack.nodes import FARMReader


# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

## Creating the Retriever-Reader Pipeline

In this tutorial, we're using a ready-made pipeline called `ExtractiveQAPipeline`. It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on. To learn more about pipelines, see [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines).

To create the pipeline, run:

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)


## Ask a question!

In [None]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k for retriever, the better (but also the slower) your answers.
prediction = pipe.run(
    query="Who is Krishna?", 
    #query="Who is called the divine son of Devaki?", 
    #query="Who received Lord Shiva's boon?",
    #query="Who desired to perform the Rajasuya sacrifice?",
    #query="Who asked Ekalavya for his thumb?",
    #query="Why did Drona award the brahmastra weapon to Arjuna?",
    #query="Who learned about bhargav astra?",
    #query="Who is the daughter of the sage Gautama?",
    params={"Retriever": {"top_k": 10}, 
            "Reader": {"top_k": 2}}
)


## Print the answers

In [None]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="medium" ## Choose from `minimum`, `medium`, and `all`
)