# A question and answer system

This uses InMemoryDocumentStore, BM25Retriever and FARMReader to create a question and answer system from a corpus of documents.

The corpus being used is a selection of text books from Project Gutenberg. The books are all Charles Dickens Novels. A list of the novels included in the training model are included at the end of this notebook.


## Overview

This uses Haystack DocumentStore, Retriever and Reader to read Charles Dickens Novels and answer questions like, "Who is Bob Cratchit's Employer," and "What are the two cities in A Tale of Two Cities?"



## Environment

The models being used for the question and answer session rely on a GPU. I used Google Colab for this, since it is possible to configure notebooks to use a GPU there. This notebook originally existed in google colab.

(In google colab, to enable GPU, select Change Runtime Type from the Runtime Menu, and then choose Hardware Accelerator > GPU)

The model also requires installing some modules via pip as follows:

### Installing Haystack



In [1]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 21.3 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting farm-haystack[colab,inference]
  Downloading farm_haystack-1.24.1-py3-none-any.whl.metadata (27 kB)
Collecting boilerpy3 (from farm-haystack[colab,inference])
  Downloading boilerpy3-1.0.7-py3-none-any.whl.metadata (5.8 kB)
Collecting events (from farm-haystack[colab,inference])
  Downloading Events-0.5-py3-none-any.whl.metadata (3.9 kB)
Collecting httpx (from farm-haystack[colab,inference])
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting lazy-imports==0.3.1 (from farm-haystack[colab,inference])
  Downloading lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Collecting posthog (from farm-haystack[cola



And we will set the loggin level to INFO so we know more of what is going on

In [4]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Document repository: DocumentStore

To avoid remote storage requirements, we will us an in-memory DocumentStore. If we were creating a live production system we would want a disk-based document store, but for purposes of this demonstration it is sufficient to use an in-memory store.

We'll initialize the store here:

In [2]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

## Uploading Documents

I downloaded 37 documents from Project Gutengerg, all freely available texts of the following Charles Dickens novels:

*   A Child's Dream of a Star
*   A Child's History of England
*   A Christmas Carol in Prose; Being a Ghost Story of Christmas
*   A Tale of Two Cities
*   American Notes
*   Barnaby Rudge: A Tale of the Riots of 'Eighty
*   Bleak House
*   David Copperfield
*   Dombey and Son
*   Great Expectations
*   Hard Times
*   Hunted Down: The Detective Stories of Charles Dickens
*   Little Dorrit
*   Martin Chuzzlewit
*   Master Humphrey's Clock
*   Mugby Junction
*   Nicholas Nickleby
*   Oliver Twist
*   Our Mutual Friend
*   Pictures from Italy
*   Sketches by Boz, Illustrative of Every-Day Life and Every-Day People
*   Some Christmas Stories
*   Speeches: Literary and Social
*   The Battle of Life: A Love Story
*   The Chimes
*   The Cricket on the Hearth
*   The Cricket on the Hearth: A Fairy Tale of Home
*   The Haunted Man and the Ghost's Bargain
*   The Holly-Tree
*   The Magic Fishbone
*   The Mystery of Edwin Drood
*   The Old Curiosity Shop
*   The Pickwick Papers
*   The Seven Poor Travellers
*   The Uncommercial Traveller
*   Three Ghost Stories
*   To Be Read at Dusk

I downloaded each file from [project gutenberg](https://www.gutenberg.org/) to a local disk, then zipped the file and uploaded it to personal AWS S3 bucket.



In [3]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/qa_system"

import os
os.system("rm -rf " + doc_dir)

fetch_archive_from_http(
    url="https://scelerat-gutenberg-downloads.s3.amazonaws.com/dickens.zip",
    output_dir=doc_dir,
)

True

## Indexing documents

We will take the the files we uploaded and use `TextIndexingPipeline` to convert  them into Haystack [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document), writing them into the DocumentStore.

In [4]:

from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/dickens/" + f for f in os.listdir(doc_dir + '/dickens')]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Converting files: 100%|██████████| 37/37 [00:15<00:00,  2.45it/s]
Preprocessing: 100%|██████████| 37/37 [00:18<00:00,  2.02docs/s]
Updating BM25 representation...: 100%|██████████| 28111/28111 [00:03<00:00, 7501.97 docs/s]


{'documents': [<Document: {'content': '\ufeffThe Project Gutenberg eBook of The Seven Poor Travellers\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Seven Poor Travellers\n\nAuthor: Charles Dickens\n\nRelease date: July 1, 1998 [eBook #1392]\nMost recently updated: December 31, 2020\n\nLanguage: English\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE SEVEN POOR TRAVELLERS ***\n\nTranscribed from the 1894 Chapman and Hall edition of "Christmas Stories"\nby David Price, email ccx074@coventry.ac.uk\n\nTHE SEVEN POOR TRAVELLERS--IN THREE CHAPTERS\n\nCHAPTER I--IN THE OLD CITY OF

## Retriever

We need to initialize the Retriever, which filters the Documents for ones which are relevant to the question. We are using the [BM25 Retriever](https://docs.haystack.deepset.ai/docs/retriever#bm25-recommended) which is a general-purpose retriever which does not rely on a neural network for indexing.

The retriever is initialized by passing in the document store we created above.

In [5]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

Now that the Retriever is initialized, we need to prepare the Reader.

---



## Reader

The reader is the next step in the process. It receives texts from the retriever and scans them for answer candidates. The reader is more accurate than the retreiver but also slower, which is why we construct this pipeline in a two-phase way.

The [RoBERTa Model](https://huggingface.co/deepset/roberta-base-squad2) has been trained on question-answer pairs sa well as unanswerable questions.


In [6]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

## Pipeline



In [7]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Asking Question

To ask a question we use the `run()` method on the pipeline.

In [26]:
prediction = pipe.run(
    query="Who is Bob Cratchit's employer?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples: 100%|██████████| 1/1 [00:23<00:00, 23.78s/ Batches]


In [27]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': 'Scrooge', 'type': 'extractive', 'score': 0.5262583494186401, 'context': '"Now, I\'ll tell you what, my friend," said Scrooge, "I\nam not going to stand this sort of thing any longer. And\ntherefore," he continued, leaping from', 'offsets_in_document': [{'start': 43, 'end': 50}], 'offsets_in_context': [{'start': 43, 'end': 50}], 'document_ids': ['ec6b4b4a3cb36adcaca6791c40e6e004'], 'meta': {'_split_id': 152}}>,
             <Answer {'answer': 'Master Peter', 'type': 'extractive', 'score': 0.24550631642341614, 'context': 'inda Cratchit, second of\nher daughters, also brave in ribbons; while Master Peter\nCratchit plunged a fork into the saucepan of potatoes, and\ngetting t', 'offsets_in_document': [{'start': 275, 'end': 287}], 'offsets_in_context': [{'start': 69, 'end': 81}], 'document_ids': ['958a7c0c8786d50cd1c13cc0243e20ad'], 'meta': {'_split_id': 83}}>,
             <Answer {'answer': 'Mr. Dickens', 'type': 'extractive', 'score': 0.2353971302509308, '

If we want a single answer, we do the following:

In [30]:
print(prediction['answers'][0].to_dict()['answer'])

Scrooge


## Print the Answers

In [9]:
from haystack.utils import print_answers

print_answers(prediction, details="minimum")  ## Choose from `minimum`, `medium`, and `all`

"Query: Who is Bob Cratchit's employer?"
'Answers:'
[   {   'answer': 'Scrooge',
        'context': '"Now, I\'ll tell you what, my friend," said Scrooge, "I\n'
                   'am not going to stand this sort of thing any longer. And\n'
                   'therefore," he continued, leaping from'},
    {   'answer': 'Master Peter',
        'context': 'inda Cratchit, second of\n'
                   'her daughters, also brave in ribbons; while Master Peter\n'
                   'Cratchit plunged a fork into the saucepan of potatoes, '
                   'and\n'
                   'getting t'},
    {   'answer': 'Mr. Dickens',
        'context': '_ a goose, Martha!” can never be forgotten.  By\n'
                   'some conjuring trick, Mr. Dickens takes off his own head '
                   'and puts on a\n'
                   'Cratchit’s.  Later Bob Cratchit'},
    {   'answer': 'Scrooge',
        'context': 'e court for help\n'
                   'and a strait-waistcoat.\n'
        

## Answering multiple questions

We can set the number of documents the Reader and Retriever returns using the `top-k` parameter.

Now we will ask a number of questions to see if the model works:






In [32]:
questions = [
    "Who is Bob Cratchit's employer?", # Scrooge
    "How many ghosts visit Scrooge?", # Trick question
    "Who is the father of Tiny Tim?", # Bob Cratchit
    "Where does Mr. Lorry wish to go?", # France
]

In [33]:
answers = []
for question in questions:
  prediction = pipe.run(
    query=question, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
  answers.append(prediction['answers'][0].to_dict()['answer'])


Inferencing Samples: 100%|██████████| 1/1 [00:22<00:00, 22.31s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:20<00:00, 20.37s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:20<00:00, 20.08s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:23<00:00, 23.43s/ Batches]


In [34]:
pprint(answers)

['Scrooge', 'More than eighteen hundred', 'Bob', 'France']


And there we have it, a simple question and answer pipeline.

Each question took more than 20 seconds to execute and I tried but did not succeed in building an environment which could execute this in AWS and build an API around it.

I would tweak more parameters and perhaps reduce the number of documents and/or break them up more to see if I could get the execution time down. Unfortunately *I* ran out of time as well.