# Simple Question Answering (QA) with Haystack

The purpose of this notebook is to explore how [Haystack](https://haystack.deepset.ai) can be used as a framework for Question Answering (QA) systems.

A few different approaches to QA are tested below:

- [Extractive QA](#Extractive-QA)
- [Generative QA](#Generative-QA), using:
  - [Long-form Question Answering](#LFQA) using sequence-to-sequence (`Seq2Seq`) models
  - [OpenAI completions API](#OpenAI), i.e. GPT-3

In [1]:
# This is a set of questions we want to ask and see which answers we obtain from the models

questions = [
    "Do I have to sign a contract with Red Hat in order to deploy a ROSA cluster?",
    "Is ROSA GDPR Compliant?",             # https://www.rosaworkshop.io/rosa/14-faq/#is-rosa-gdpr-compliant
    "How can I upgrade my ROSA cluster?",  # https://www.rosaworkshop.io/rosa/9-upgrade/
    "What is STS?",      # https://www.rosaworkshop.io/rosa/15-sts_explained/#what-is-aws-sts-security-token-service
    "How is ROSA related to Kubernetes?", # https://docs.openshift.com/rosa/rosa_architecture/rosa_architecture_sub/rosa-basic-architecture-concepts.html#rosa-kubernetes-concept_rosa-basic-architecture-concepts
    "How can I federate metrics?",    # https://mobb.ninja/docs/rosa/federated-metrics/
    "Is there any tool to help me troubleshoot my VPC connection?", # https://docs.openshift.com/rosa/rosa_cluster_admin/cloud_infrastructure_access/dedicated-aws-peering.html#dedicated-aws-vpc-verifying-troubleshooting
    "What time is it?",  # adversarial
]

### Preparation

In [2]:
# Imports
import boto3
import logging
import os
from dotenv import load_dotenv, find_dotenv
from glob import glob
from haystack import Document
from haystack.document_stores import InMemoryDocumentStore, FAISSDocumentStore
from haystack.nodes import BM25Retriever, FARMReader, RAGenerator, DensePassageRetriever, Seq2SeqGenerator
from haystack.pipelines import ExtractiveQAPipeline, GenerativeQAPipeline, DocumentSearchPipeline, Pipeline
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
from haystack.nodes import OpenAIAnswerGenerator
from haystack.nodes import MarkdownConverter, PreProcessor
from haystack.utils import convert_files_to_docs, print_answers, print_documents

# Configure logging
logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.WARNING)

#### Input files

We use a dataset consisting of a series of Markdown files about ROSA.

In [3]:
# Load custom environment variables
## Create a .env file on your local with the correct configs
load_dotenv(find_dotenv())

api_key = os.getenv("OPENAI_API_KEY")  # Required by the OpenAI generator

# Obtain a copy of the current dataset

s3 = boto3.client('s3',
                  endpoint_url=os.getenv("S3_ENDPOINT_URL"),
                  aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
                  aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY"))
s3_bucket_name = os.getenv("S3_BUCKET_NAME")
s3_path = os.getenv("S3_PROJECT_KEY", "rosa-docs/")  # Limit which files we get

# Download data from Ceph
data_dir = "../data/raw"
for key in s3.list_objects(Bucket=s3_bucket_name)['Contents']:
    if key["Key"].startswith(s3_path):
        print("Downloading", key['Key'])
        filename = os.path.join(data_dir, key["Key"].removeprefix(s3_path))
        s3.download_file(s3_bucket_name, key["Key"], filename)

Downloading rosa-docs/adding_service_cluster.md
Downloading rosa-docs/applications.md
Downloading rosa-docs/authentication.md
Downloading rosa-docs/cicd.md
Downloading rosa-docs/logging.md
Downloading rosa-docs/networking.md
Downloading rosa-docs/ocm.md
Downloading rosa-docs/rosa_architecture.md
Downloading rosa-docs/rosa_backing_up_and_restoring_applications.md
Downloading rosa-docs/rosa_cli.md
Downloading rosa-docs/rosa_cluster_admin.md
Downloading rosa-docs/rosa_getting_started.md
Downloading rosa-docs/rosa_install_access_delete_clusters.md
Downloading rosa-docs/rosa_planning.md
Downloading rosa-docs/rosa_support.md
Downloading rosa-docs/serverless.md
Downloading rosa-docs/service_mesh.md
Downloading rosa-docs/storage.md
Downloading rosa-docs/upgrading.md
Downloading rosa-docs/welcome.md


In [4]:
# Where are the input text files
# Besides the dataset, we also add the samples available in this repo
local_samples = ["../data/external/rosaworkshop", "../data/external/rh-mobb"]
doc_dirs = [data_dir] + local_samples

# Which files will we consider
file_pattern = "*.md"
files_to_index = [file for doc_dir in doc_dirs for file in glob(os.path.join(doc_dir, file_pattern))]

print(f"There are {len(files_to_index)} files")

There are 74 files


## Extractive QA

Extractive QA is about extracting an answer to the question from a given context. A context is provided so that the model can refer to it and make predictions about where the answer is inside the passage.

This is a quick experiment based on the [Haystack simple tutorial for QA](https://haystack.deepset.ai/tutorials/01_basic_qa_pipeline). Like in the Haystack tutorial, here we are using a base RoBERTa model fine-tuned using the SQuAD 2.0 dataset, [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2)


#### Pre-processing

There is some pre-processing of the source files here, in order to ingest from Markdown and to accomodate the passage size to what the models can handle.

As the purpose of this notebook is to explore what Haystack offers, we are using Haystack pre-processing tools.

**NOTE**: splitting the documents here is most likely breaking the series of long steps that are present in several of the source files! In the future, this pre-processing should do proper markdown parsing and do e.g. meaningful splits that preserve the structure


In [5]:
# Pre-process docs
# Ref: https://haystack.deepset.ai/tutorials/08_preprocessing#preprocessor
# Quote: File splitting can have a very significant impact on the system’s performance and is absolutely mandatory for Dense Passage Retrieval models.

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=500,
    split_respect_sentence_boundary=True,
)

# Converting from markdown
# Ref: https://docs.haystack.deepset.ai/docs/file_converters
converter = MarkdownConverter(
    remove_numeric_tables=True
)

**FIXME**:
I noticed a problem with the pre-processing above: it strips content it shouln't. Example:

Original content (from rosa docs: rosa_planning.md):

```
The AWS IAM roles required to use OpenShift Cluster Manager are:

-   `ocm-role`

-   `user-role`

Whether you manage your clusters using the `rosa` CLI or[...]
```


Content that ends up in the data store:

```
sqlite> select * from document where id="10a6cc97b3d4ede49529cb87db7b637f";
10a6cc97b3d4ede49529cb87db7b637f|2023-02-23 20:18:24|2023-02-23 20:18:24|"The AWS IAM roles required to use OpenShift Cluster Manager are:\n\nWhether you manage your clusters using the   CLI or[...]
```

Notice how all the content within backquotes (like `rosa` or `user-role`) has been removed.

In [6]:
# Initialize an in-memory document store
document_store = InMemoryDocumentStore(use_bm25=True)

# Index the documents into the document store
indexing_pipeline = TextIndexingPipeline(document_store=document_store, preprocessor=preprocessor, text_converter=converter)
documents = indexing_pipeline.run_batch(file_paths=files_to_index)

print(f"There are {len(documents['documents'])} documents from {len(documents['file_paths'])} files")

# Initialize the retriever, reader, and retriever-reader pipeline
retriever = BM25Retriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)
pipe = ExtractiveQAPipeline(reader, retriever)

Converting files:   0%|          | 0/74 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/74 [00:00<?, ?docs/s]



Updating BM25 representation...:   0%|          | 0/705 [00:00<?, ? docs/s]

There are 705 documents from 74 files


#### Answering questions

Here we are going to ask some questions and see the answers obtained

In [7]:
for question in questions:
    prediction = pipe.run(
        query=question,
        params={
            "Retriever": {"top_k": 10},
            "Reader": {"top_k": 3}
        }
    )
    # Show a simplified list of answers
    print_answers(
        prediction,
        details="minimum" ## Choose from `minimum`, `medium`, and `all`
    )



Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]




Query: Do I have to sign a contract with Red Hat in order to deploy a ROSA cluster?
Answers:
[   {   'answer': 'You do not need to have a contract with Red Hat',
        'context': 'Do I need to sign/have a contract with Red Hat?\n'
                   'No. You do not need to have a contract with Red Hat to use '
                   'ROSA. You will need a Red Hat account for u'},
    {   'answer': 'If this is the first time you are deploying ROSA in this '
                  'account and have not yet created the account roles',
        'context': 'sociated AWS account.\n'
                   'If this is the first time you are deploying ROSA in this '
                   'account and have not yet created the account roles, then '
                   'create the acc'},
    {   'answer': 'You will need a Red Hat account',
        'context': 'ou do not need to have a contract with Red Hat to use '
                   'ROSA. You will need a Red Hat account for use on '
                   'console

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]




Query: Is ROSA GDPR Compliant?
Answers:
[   {   'answer': 'Yes:',
        'context': 'r own backup policies for applications and data.\n'
                   'Is ROSA GDPR Compliant?\n'
                   'Yes: https://www.redhat.com/en/gdpr\n'
                   'Does the ROSA CLI accept Multi-region KMS'},
    {   'answer': '2',
        'context': 'rker nodes that a ROSA cluster can have?\n'
                   'For a ROSA cluster the minimum is 2 worker nodes for '
                   'single AZ and 3 for multiple AZ.\n'
                   'Where can I find the pr'},
    {   'answer': 'both methods are currently enabled',
        'context': 'account in order to create and operate the cluster. While '
                   'both methods are currently enabled, the “ROSA with STS” '
                   'method is the preferred and recommen'}]


Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]




Query: How can I upgrade my ROSA cluster?
Answers:
[   {   'answer': 'by using the   CLI',
        'context': 'S cluster that uses the AWS Security Token Service (STS) '
                   'manually by using the   CLI.\n'
                   'This method schedules the cluster for an immediate '
                   'upgrade, if a'},
    {   'answer': '180',
        'context': 'r nodes that a cluster can support?\n'
                   'The maximum number of worker nodes is 180 per ROSA '
                   'cluster.  See here for limits and scalability '
                   'considerations an'},
    {   'answer': '2',
        'context': 'rker nodes that a ROSA cluster can have?\n'
                   'For a ROSA cluster the minimum is 2 worker nodes for '
                   'single AZ and 3 for multiple AZ.\n'
                   'Where can I find the pr'}]


Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]




Query: What is STS?
Answers:
[   {   'answer': 'AWS Security Token Service',
        'context': 'Red Hat OpenShift Service on AWS (ROSA) cluster that uses '
                   'the AWS Security Token Service (STS).\n'
                   'Account\n'
                   '\n'
                   'You must ensure that the AWS limits are suffi'},
    {   'answer': 'Security Token Service',
        'context': 'ethod is the preferred and recommended option.\n'
                   'What is AWS STS (Security Token Service)?\n'
                   'As stated in the AWS documentation AWS STS “enables you to '
                   're'},
    {   'answer': 'AWS Security Token Service',
        'context': ' into a customer’s existing Amazon Web Service (AWS) '
                   'account.\n'
                   'AWS Security Token Service (STS) is the recommended '
                   'credential mode for installing and i'}]


Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]




Query: How is ROSA related to Kubernetes?
Answers:
[   {   'answer': 'affects all Kubernetes distributions',
        'context': 'data store (etcd). This design is not unique to ROSA and '
                   'affects all Kubernetes distributions. Anyone with API '
                   'access can retrieve or modify a Secret,'},
    {   'answer': '2',
        'context': 'rker nodes that a ROSA cluster can have?\n'
                   'For a ROSA cluster the minimum is 2 worker nodes for '
                   'single AZ and 3 for multiple AZ.\n'
                   'Where can I find the pr'},
    {   'answer': 'you must first containerize your app by creating a '
                  'container image that you store in a container registry',
        'context': 'in Kubernetes on ROSA, you must first containerize your '
                   'app by creating a container image that you store in a '
                   'container registry.\n'
                   'Image\n'
                   'A container i

Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]




Query: How can I federate metrics?
Answers:
[   {   'answer': 'HTTP code',
        'context': 'trics for Knative Eventing components.\n'
                   'By aggregating the metrics from HTTP code, events can be '
                   'separated into two categories; successful events (2xx)'},
    {   'answer': 'aggregating the metrics from HTTP code',
        'context': 'e following metrics for Knative Eventing components.\n'
                   'By aggregating the metrics from HTTP code, events can be '
                   'separated into two categories; successfu'},
    {   'answer': 'using the MOBB Helm Chart to deploy the necessary agents to '
                  'federate the metrics into AWS Prometheus',
        'context': 'de will walk you through using the MOBB Helm Chart to '
                   'deploy the necessary agents to federate the metrics into '
                   'AWS Prometheus and then use Grafana to '}]


Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]




Query: Is there any tool to help me troubleshoot my VPC connection?
Answers:
[   {   'answer': 'verification procedure',
        'context': 'k Save.\n'
                   '\n'
                   'The VPC peering connection is now complete. Follow the '
                   'verification procedure to ensure connectivity across the '
                   'peering connection is working'},
    {   'answer': 'AWS documentation',
        'context': ' public and private subnets and AWS Site-to-Site VPN '
                   'access in the AWS documentation.\n'
                   '\n'
                   'Policies and service definition\n'
                   'About availability for Red Hat '},
    {   'answer': 'Route Propagation',
        'context': '\n'
                   '\n'
                   'After the VPN connection has been established, be sure to '
                   'set up Route Propagation or the VPN may not function as '
                   'expected.\n'
                   'Note the VPC

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Query: What time is it?
Answers:
[   {   'answer': '2',
        'context': 'rker nodes that a ROSA cluster can have?\n'
                   'For a ROSA cluster the minimum is 2 worker nodes for '
                   'single AZ and 3 for multiple AZ.\n'
                   'Where can I find the pr'},
    {   'answer': 'an hour',
        'context': 'esources in your AWS account. After these credentials '
                   'expire (typically an hour after being requested), they are '
                   'no longer recognized by AWS and they '},
    {   'answer': 'an hour after being requested',
        'context': ' your AWS account. After these credentials expire '
                   '(typically an hour after being requested), they are no '
                   'longer recognized by AWS and they no longer h'}]


### Observations about extractive QA

Some comments about the results:

- All the answers are very short. Some questions would point to relatively long procedures, with multiple steps that are detailed in the documents. Other questions, however, do have short answers
- Not all answers seem to make sense

#### Next steps / TODO

- Add metadata to the documents pointing to the original file, so that the question can eventually point to the location (URL) of the source
- Proper testing and evaluation
- Before that, however, it would be interesting to explore if there are ways to influence the lenght / extent of the answer; this is important for situations where the question is about a procedure that has clear steps to follow.

## Generative QA

Generative QA is about creating novel text during the answering process. While extractive QA highlights the span of text from the context that answers a query, generative QA creates new text.

The general approach here is to start by obtaining vector embeddings to represent the domain-specific knowledge (context) and store them in a vector database. When time comes to generate an answer to a query, the process becomes:
1. calculate the embedding for the question 
2. query the embeddings vector store to obtain a (set of) relevant documents
3. **retrieve** the relevant documents (the ones found in the previous step) and pass them as context to the model, along with the question
4. the model **generates** the answer

The examples below use [Faiss](https://faiss.ai/) as the vector store.

References:
- https://docs.haystack.deepset.ai/docs/answer_generator

### LFQA

Long-Form Question Answering (LFQA) is a variety of the generative question answering task. LFQA systems query large document stores for relevant information and then use this information to generate accurate, multi-sentence answers.

In a extratcive question answering system, the retrieved documents related to the query (context passages) act directly as source tokens for extracted answers. In an LFQA system, context passages provide the context the system uses to generate original, abstractive, long-form answers.

Here we use a Seq2seq generator with the [vblagoje/bart_lfqa model](https://huggingface.co/vblagoje/bart_lfqa) model.

References:
- https://haystack.deepset.ai/tutorials/12_lfqa
- https://yjernite.github.io/lfqa.html
- https://towardsdatascience.com/long-form-qa-beyond-eli5-an-updated-dataset-and-approach-319cb841aabb

In [9]:
faiss_index = "faiss_lfqa.index"
faiss_config = "faiss_lfqa.cfg"

if os.path.isfile(faiss_config):
    document_store = FAISSDocumentStore.load(index_path=faiss_index, config_path=faiss_config)
    # Delete existing documents in documents store
    document_store.delete_documents()
else:
    document_store = FAISSDocumentStore(sql_url="sqlite:///faiss_rag.db", embedding_dim=128)

# Write documents to document store
document_store.write_documents(documents["documents"])

Writing Documents:   0%|          | 0/705 [00:00<?, ?it/s]

In [10]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="vblagoje/dpr-question_encoder-single-lfqa-wiki",
    passage_embedding_model="vblagoje/dpr-ctx_encoder-single-lfqa-wiki",
)

document_store.update_embeddings(retriever)
document_store.save(faiss_index)

generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


Updating Embedding:   0%|          | 0/705 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/720 [00:00<?, ? Docs/s]

Before jumping into question answering, let's do a quick check of the retriever only.

The retriver allows us to select a set of documents that are most relevant to the question being asked, based on embedding similarity between the (embeddings of the) question and the contents of the vector store.

Here we query the retriever using a question (query) and observe which passages (documents) it selects:

In [11]:
p_retrieval = DocumentSearchPipeline(retriever)
res = p_retrieval.run(query="Are AWS IAM roles relevant to ROSA?", params={"Retriever": {"top_k": 10}})
print_documents(res, max_text_len=512)


Query: Are AWS IAM roles relevant to ROSA?

{   'content': 'Understanding ROSA\n'
               'Learn about Red Hat OpenShift Service on AWS (ROSA), '
               'interacting with ROSA using Red Hat OpenShift Cluster Manager '
               'and command-line interface (CLI) tools, consumption '
               'experience, and integration with Amazon Web Services (AWS) '
               'services.\n'
               'About ROSA\n'
               'ROSA is a fully-managed, turnkey application platform that '
               'allows you to focus on delivering value to your customers by '
               'building and deploying applications. Red Hat and AWS Site '
               'reliability engineering (SRE) experts manage the underlying '
               'platform...',
    'name': None}

{   'content': 'AWS Shared VPCs are not currently supported for ROSA '
               'installations.\n'
               '\n'
               'You have completed the AWS prerequisites for ROSA with STS.\

#### Question answering

Now let's answer the same set of questions as before

In [12]:
pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)
for question in questions:
    res = pipe.run(query=question, params={"Retriever": {"top_k": 3}})
    print_answers(res, details="medium")


Query: Do I have to sign a contract with Red Hat in order to deploy a ROSA cluster?
Answers:
[   {   'answer': 'Yes, you have to sign a contract with Red Hat in order to '
                  'deploy a ROSA cluster.'}]

Query: Is ROSA GDPR Compliant?
Answers:
[   {   'answer': 'ROSA is not GDPR compliant. It is a cloud service, which '
                  'means that it is not governed by the GDPR. However, it is '
                  'subject to the same laws as any other cloud service.'}]

Query: How can I upgrade my ROSA cluster?
Answers:
[   {   'answer': 'ROSA with STS requires you to install the latest version of '
                  "Red Hat OpenShift. If you don't want to do that, you can "
                  'upgrade your cluster to a newer version of OpenShift using '
                  'the AWS IAM Console.'}]

Query: What is STS?
Answers:
[   {   'answer': 'ROSA is a service that needs to manage infrastructure '
                  'resources in your AWS account. In order to manage t

### OpenAI

The OpenAI Answer generator uses OpenAI's completion API to generate answers.

The pipeline used here is similar to the previous one in that it uses the same dense passage retrieval approach to prepare context that is submitted to the completion API together with the question.

In [13]:
# Let's initiate the OpenAIAnswerGenerator 
generator = OpenAIAnswerGenerator(
    api_key=api_key,
    model="text-davinci-003",
    max_tokens=150,
    presence_penalty=0.1,
    frequency_penalty=0.1,
    top_k=2,
    temperature=0.9
)

In [14]:
pipe = GenerativeQAPipeline(generator=generator, retriever=retriever) # Same retriever as before

In [15]:
for question in questions:
    res = pipe.run(query=question, params={"Retriever": {"top_k": 2}})
    print_answers(res, details="medium")


Query: Do I have to sign a contract with Red Hat in order to deploy a ROSA cluster?
Answers:
[   {   'answer': ' No, you do not have to sign a contract with Red Hat in '
                  'order to deploy a ROSA cluster. However, you must meet the '
                  'customer requirements listed in the documentation and '
                  'ensure that your AWS account is set up according to the Red '
                  'Hat prerequisites.'},
    {   'answer': ' No, you do not have to sign a contract with Red Hat in '
                  'order to deploy a ROSA cluster.'}]

Query: Is ROSA GDPR Compliant?
Answers:
[   {   'answer': ' ROSA does not have GDPR compliance built into the '
                  'platform. However, users may be able to configure their own '
                  'solutions to achieve GDPR compliance within their '
                  'environment.'},
    {   'answer': ' ROSA does not have GDPR compliance features built into the '
                  'platform. However, cu

#### Observations about OpenAI based answers

- The generated text looks more elaborate / friendly than other alternatives
- The adversarial example is correctly handled, unlike in other alternatives
- Still, some answers are just wrong (e.g. GDPR)

## Next steps / TODO

- FIXME: proper markdown parsing that:
  - does not remove content
  - splits documents while preserving meaningful sturcture
- Identify the context (i.e. which sources were used to obtain the answers) in all the answers. Include links to sources
- It would be interesting to be able to observe the API calls that the pipeline does, i.e. which prompts are being used