<a href="https://colab.research.google.com/github/satishsingh-singh90/movie-recommender/blob/main/final_questions_and_answer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Build a Scalable Question Answering System

- **Level**: Beginner
- **Time to complete**: 20 minutes
- **Nodes Used**: `ElasticsearchDocumentStore`, `BM25Retriever`, `FARMReader`
- **Goal**: After completing this tutorial, you'll have built a scalable search system that runs on text files and can answer questions about Game of Thrones. You'll then be able to expand this system for your needs.


In [22]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Overview

Learn how to set up a question answering system that can search through complex knowledge bases and highlight answers to questions such as "Who is the father of Arya Stark?". In this tutorial, we'll work on a set of Wikipedia pages about Game of Thrones, but you can adapt it to search through internal wikis or a collection of financial reports, for example.

This tutorial introduces you to all the concepts needed to build such a question answering system. It also uses Haystack components, such as indexing pipelines, querying pipelines, and DocumentStores backed by external database services.

Let's learn how to build a question answering system and discover more about the marvelous seven kingdoms!


## Preparing the Colab Environment

- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)


## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

In [1]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,inference]

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 14.8 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting farm-haystack[colab,elasticsearch,inference,preprocessing]
  Downloading farm_haystack-1.25.1-py3-none-any.whl.metadata (27 kB)
Collecting boilerpy3 (from farm-haystack[colab,elasticsearch,inference,preprocessing])
  Downloading boilerpy3-1.0.7-py3-none-any.whl.metadata (5.8 kB)
Collecting events (from farm-haystack[colab,elasticsearch,inference,preprocessing])
  Downloading Events-0.5-py3-none-any.whl.metadata (3.9 kB)
Collecting httpx (from farm-haystack[colab,elasticsearch,inference,preprocessing])
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting lazy-imports==0.3.1 (from farm-haystack[colab,elasticsea



### Enabling Telemetry
Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details.

In [2]:
from haystack.telemetry import tutorial_running

tutorial_running(3)

Set the logging level to INFO:

In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the ElasticsearchDocumentStore

A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) which connects to a running Elasticsearch service. It's a fast and scalable text-focused storage option. This service runs independently from Haystack and persists even after the Haystack program has finished running. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

1. Download, extract, and set the permissions for the Elasticsearch installation image:

In [4]:
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

2. Start the server:

In [5]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

If Docker is available in your environment (Colab notebooks do not support Docker), you can also start Elasticsearch using Docker. You can do this manually, or using our [`launch_es()`](https://docs.haystack.deepset.ai/reference/utils-api#module-doc_store) utility function.

In [None]:
# from haystack.utils import launch_es

# launch_es()

3. Wait 30 seconds for the server to fully start up:

In [6]:
import time

time.sleep(30)

4. Initialize the ElasticsearchDocumentStore:


In [7]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")

ElasticsearchDocumentStore is up and running and ready to store the Documents.

## Indexing Documents with a Pipeline

The next step is adding the files to the DocumentStore. The indexing pipeline turns your files into Document objects and writes them to the DocumentStore. Our indexing pipeline will have two nodes: `TextConverter`, which turns `.txt` files into Haystack `Document` objects, and `PreProcessor`, which cleans and splits the text within a `Document`.

Once we combine these nodes into a pipeline, the pipeline will ingest `.txt` file paths, preprocess them, and write them into the DocumentStore.


1. Download 517 articles from the Game of Thrones Wikipedia. You can find them in *data/build_a_scalable_question_answering_system* as a set of *.txt* files.

In [30]:
#from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_a_scalable_question_answering_system"

fetch_archive_from_http(
    url="/content/data/build_a_scalable_question_answering_system/dpr_50_CUADv1.json",
    output_dir=doc_dir,
)

INFO:haystack.utils.import_utils:Found data stored in 'data/build_a_scalable_question_answering_system'. Delete this first if you really want to fetch new data.


False

In [21]:
type(doc_dir)

str

2. Initialize the pipeline, TextConverter, and PreProcessor:

In [31]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

To learn more about the parameters of the `PreProcessor`, see [Usage](https://docs.haystack.deepset.ai/docs/preprocessor#usage). To understand why document splitting is important for your question answering system's performance, see [Document Length](https://docs.haystack.deepset.ai/docs/optimization#document-length).

2. Add the nodes into an indexing pipeline. You should provide the `name` or `name`s of preceding nodes as the `input` argument. Note that in an indexing pipeline, the input to the first node is `File`.

In [32]:
import os

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

3. Run the indexing pipeline to write the text data into the DocumentStore:

In [33]:
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.
Converting files: 100%|██████████| 184/184 [00:08<00:00, 20.62it/s]
Preprocessing: 100%|██████████| 184/184 [00:11<00:00, 15.52docs/s]


{'documents': [<Document: {'content': "\n\n'''Myrcella Baratheon''' is a fictional character in the ''A Song of Ice and Fire'' series of epic fantasy novels by American author George R. R. Martin, and its television adaptation ''Game of Thrones''.  Myrcella's character, development and her interactions and impact differ greatly between the two genres.\n\nIntroduced in 1996's ''A Game of Thrones'', Myrcella is the only daughter of Cersei Lannister from the kingdom of Westeros. She subsequently appeared in Martin's ''A Clash of Kings'' (1998) and ''A Feast for Crows'' (2005).\n\nMyrcella is portrayed by Irish actress Aimee Richardson in the first two seasons of the HBO television adaptation, while English actress Nell Tiger Free portrays her in the show's fifth and sixth seasons.\n\n==Character==\nSince Myrcella Baratheon is not a point of view character in ''A Song of Ice and Fire'', the reader learns about her through other characters' perspectives, such as her uncle Tyrion Lannister. 

The code in this tutorial uses Game of Thrones data, but you can also supply your own `.txt` files and index them in the same way.

As an alternative, you can cast you text data into [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document) and write them into the DocumentStore using [`DocumentStore.write_documents()`](https://docs.haystack.deepset.ai/reference/document-store-api#basedocumentstorewrite_documents).

Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline.

## Initializing the Retriever

Our query pipeline is going to use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only those that are relevant to the question. This tutorial uses the BM25Retriever. This is the recommended Retriever for a question answering system like the one we're creating. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

In [41]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

The BM25Retriever is initialized and ready for the pipeline.

## Initializing the Reader

Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This tutorials uses a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with. To find a model that's best for your use case, see [Models](https://docs.haystack.deepset.ai/docs/reader#models).

In [42]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="/content/drive/MyDrive/deberta v3 base with 100/deberta-v3-base", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.model.language_model: * LOADING MODEL: '/content/drive/MyDrive/deberta v3 base with 100/deberta-v3-base' (DebertaV2)
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded '/content/drive/MyDrive/deberta v3 base with 100/deberta-v3-base' (DebertaV2 model) from model hub.
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


## Creating the Retriever-Reader Pipeline

You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever.

Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`.

In [43]:
from haystack import Pipeline

querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

That's it! Your pipeline's ready to answer your questions!

## Asking a Question

1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).


In [44]:
prediction = querying_pipeline.run(
    query= "Highlight the parts (if any) of this contract related to \"Document Name\" that should be reviewed by a lawyer. Details: The name of the contract", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples: 100%|██████████| 1/1 [00:32<00:00, 32.03s/ Batches]


Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?

2. Print out the answers the pipeline returns:

In [45]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': 'COMMERCIALIZATION AND LICENSE AGREEMENT', 'type': 'extractive', 'score': 0.9106268882751465, 'context': 'wyer. Details: The name of the contract",\n"answers": [\n"COMMERCIALIZATION AND LICENSE AGREEMENT"\n],\n"positive_ctxs": [\n{\n"title": "CytodynInc_20200109', 'offsets_in_document': [{'start': 613, 'end': 652}], 'offsets_in_context': [{'start': 56, 'end': 95}], 'document_ids': ['6f303b734e8e1c939c11293b6132c186'], 'meta': {'_split_id': 10775, '_split_overlap': [{'range': [0, 167], 'doc_id': '8eed0b41547433dac95a9ea152b13435'}, {'range': [563, 1254], 'doc_id': 'c1ec0d0a65d676615b3395d51263f233'}]}}>,
             <Answer {'answer': 'Intellectual Property Agreement', 'type': 'extractive', 'score': 0.8956485986709595, 'context': 'XECUTION VERSION\\n\\nINTELLECTUAL PROPERTY AGREEMENT\\n\\nThis Intellectual Property Agreement (the \\"Agreement\\"), is entered into as of November 20, 20', 'offsets_in_document': [{'start': 526, 'end': 557}], 'offsets_in_con

3. Simplify the printed answers:

In [46]:
from haystack.utils import print_answers

print_answers(prediction, details="minimum")  ## Choose from `minimum`, `medium` and `all`

('Query: Highlight the parts (if any) of this contract related to "Document '
 'Name" that should be reviewed by a lawyer. Details: The name of the contract')
'Answers:'
[   {   'answer': 'COMMERCIALIZATION AND LICENSE AGREEMENT',
        'context': 'wyer. Details: The name of the contract",\n'
                   '"answers": [\n'
                   '"COMMERCIALIZATION AND LICENSE AGREEMENT"\n'
                   '],\n'
                   '"positive_ctxs": [\n'
                   '{\n'
                   '"title": "CytodynInc_20200109'},
    {   'answer': 'Intellectual Property Agreement',
        'context': 'XECUTION VERSION\\n\\nINTELLECTUAL PROPERTY '
                   'AGREEMENT\\n\\nThis Intellectual Property Agreement (the '
                   '\\"Agreement\\"), is entered into as of November 20, 20'},
    {   'answer': 'Promotion Agreement"\n'
                  '],\n'
                  '"positive_ctxs": [\n'
                  '{\n'
                  '"title": "CYBERIANOUTPOSTINC

And there you have it! Congratulations on building a scalable machine learning based question answering system!

# Next Steps

To learn how to improve the performance of the Reader, see [Fine-Tune a Reader](https://haystack.deepset.ai/tutorials/02_finetune_a_model_on_your_data).

In [47]:
# This is an example Streamlit app saved as a Jupyter notebook
# Upload this notebook to Google Colab and run it to see your app in action

!pip install streamlit

import streamlit as st

def main():
    st.title("Hello, Streamlit!")
    st.write("This is a Streamlit app running on Google Colab.")

if __name__ == "__main__":
    main()


Collecting streamlit
  Downloading streamlit-1.32.2-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting packaging<24,>=16.8 (from streamlit)
  Downloading packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.8.1b0-py2.py3-none-any.whl.metadata (3.9 kB)
Collecting watchdog>=2.1.5 (from streamlit)
  Downloading watchdog-4.0.0-py3-none-manylinux2014_x86_64.whl.metadata (37 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading gitdb-4.0.11-py3-none-any.whl.metadata (1.2 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading smmap-5.0.1-py3-none-any.whl.metadata (4.3 kB)
Downloading streamlit-1.32.2-py2.py3-none-any.whl (8.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m30.5 MB

2024-04-01 16:31:20.006 
  command:

    streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
