# Question-Answering

Question Answering (QA) system is a sub-field of Natural Language Processing (NLP) that deals with automating the task of answering questions posed in natural language. It involves processing the input question, retrieving relevant information from a database or a corpus, and presenting a concise answer to the user. QA systems can be broadly classified into two types: Open Domain and Closed Domain. Open Domain QA systems are designed to answer any type of question, without being restricted to a specific topic or domain. Closed Domain QA systems, on the other hand, are designed to answer questions within a specific domain or category of knowledge, such as medicine, law, or geography. 

## Extractive QA

Extractive QA systems work by identifying and extracting a relevant answer from a large corpus of text. The answer is usually a span of text that directly answers the question. The system uses techniques such as information retrieval, named entity recognition, and co-reference resolution to extract the most relevant answer. Extractive QA systems are designed to work well when the answer is present in the text and can be accurately extracted.

Lets start by creating a corpus of text from the pdf file.

In [1]:
#Importing Packages

import transformers
from PyPDF2 import PdfReader
import pandas as pd
from transformers import pipeline
import re

In [2]:
# Open the PDF file
pdf_file = open("RH-1-productfaq.pdf", "rb")
pdf_reader = PdfReader(pdf_file)

# Loop through each page and extract text
text = ""
for page in pdf_reader.pages:
    text += page.extract_text()

# Close the PDF file
pdf_file.close()

In [3]:
# Preprocess the context
text = re.sub(r'[^\w\s\n]', '', text) # Remove punctuation
text = text.lower() # Convert to lowercase

In [4]:
text[:1000]

'home products red hat openshift re   \nred hat openshift\nservice on aws\nfrequently asked\nquestions\nget answers to common questions about red\nhat openshift service on aws rosa\xa0 learn\nhow to quickly build deploy and manage\nkubernetes applications on the industrys most\ncomprehensive application platform in aws cloud\ngeneral questions\nwhat does red hat openshift service on aws include\neach red hat openshift service on aws cluster comes with a fullymanaged\ncontrol plane master nodes and application nodes installation management\nmaintenance and upgrades are\xa0 monitored by red hat site reliability engineers\nsre with joint red hat and amazon support\xa0 cluster services such as logging\nmetrics monitoring are available as well\nhow is red hat openshift service on aws different from red hat\nopenshift container platform\nred hat openshift service on aws delivers a turnkey application platform\noptimized for performance scalability and security red hat openshift service\non a

Now, we will be using the text we have accessed to answer some questions

In [5]:
from transformers import pipeline

model_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'

quan_pipeline = pipeline(tokenizer=model_name, model=model_name, task='question-answering')

In [6]:
answer=quan_pipeline(question="what is the purpose of this document?",
             context=text)

In [7]:
answer

{'score': 0.11516688019037247,
 'start': 21770,
 'end': 21797,
 'answer': 'subscribe to our newsletter'}

In [8]:
answer=quan_pipeline(question="Is Red Hat OpenShift Service on AWS available for purchase in all countries?",
             context=text)

In [9]:
answer

{'score': 0.2048490196466446,
 'start': 3796,
 'end': 3874,
 'answer': 'available for purchase in all countries where\naws is commercially availablered'}

In [10]:
answer=quan_pipeline(question="What is the maximum number of worker node that the cluster can support?",
             context=text)

In [11]:
answer

{'score': 0.8800130486488342, 'start': 21078, 'end': 21081, 'answer': '180'}

In [12]:
answer=quan_pipeline(question="What is ROSA",
             context=text)

In [13]:
answer

{'score': 0.3828146457672119,
 'start': 11356,
 'end': 11415,
 'answer': 'an opinionated installation of openshift container platform'}

# Abstractive QA

Abstractive QA systems, on the other hand, attempt to understand the meaning of a question and generate an answer that summarizes or rephrases the information present in the text. Unlike extractive QA systems, abstractive QA systems do not simply extract an answer, but generate a new answer that may not be present in the text. This approach requires a more sophisticated understanding of language and often uses techniques such as summarization, text generation, and natural language generation. Abstractive QA systems are designed to work well when the answer is not explicitly stated in the text but can be inferred from it.



In this case, contexts are used as inputs (alongside the question) to a generative sequence-to-sequence (seq2seq) model. The model uses the question and context to generate the answer. Large transformer models store ‘representations’ of knowledge in their parameters. By passing relevant contexts and questions into the model, we hope that the model will use the context alongside its ‘stored knowledge’ to answer more abstract questions.

The seq2seq model used is commonly BART or T5-based. In our case, we initialize a seq2seq pipeline using a BART model fine-tuned for abstractive QA — yjernite/bart_eli5.

In [14]:
model_name = 'yjernite/bart_eli5'

quan_pipeline = pipeline(tokenizer=model_name, model=model_name, task='text2text-generation') #seq2seq

Here is the actual text:

"""
 What is the maximum number of worker nodes that a cluster can
support?

The maximum number of worker nodes is 180 per ROSA cluster. See here for
limits and scalability considerations and more details on node counts.
"""

In [15]:
query="What is the maximum number of worker node that the cluster can support?"
context = text[-1000:-1]

In [16]:
generated_text = quan_pipeline(f"question: {query} ? context: {context}",
                               num_beams=4,
                               do_sample=True,
                               temperature=1.5,
                               max_length=100) #Increasing the length

generated_text = generated_text[0]['generated_text']

In [17]:
generated_text

' The maximum number of worker nodes that a cluster can support is 180. See here for limits and scalability considerations and more details on node counts. See about autoscaling nodes on a cluster in the documentation for more details. The maximum number of worker nodes is 180 per rosa cluster. See here for limits and scalability considerations and more details on node counts. The maximum number of worker nodes is 180 per rosa cluster See here forlimits and scalability considerations and more details on'

Lets try another query, 

Here is the sample Q&A taken out from the FAQ doc:
    
"""
How is Red Hat OpenShift Service on AWS different from Red Hat
OpenShift Container Platform?

Red Hat OpenShift Service on AWS delivers a turnkey application platform,
optimized for performance, scalability, and security. Red Hat OpenShift Service
on AWS is hosted on Amazon Web Services public cloud and jointly managed by
Red Hat and AWS. Some options and administrative functions may be restricted
or unavailable. A Red Hat OpenShift Container Platform subscription entitles you
to host and manage the software on your infrastructure of choice.
"""

In [18]:
query = 'How is Red Hat OpenShift Service on AWS different from Red Hat OpenShift Container Platform?'

context = text[:2000] # Here I am taking the context as initial 2000 characters from the text doc. 

In [19]:
generated_text = quan_pipeline(f"question: {query} ? context: {context}",
                               num_beams=4,
                               do_sample=True,
                               temperature=1.5,
                               max_length=100) #Increasing the length

generated_text = generated_text[0]['generated_text']

In [20]:
generated_text

" Red Hat OpenShift Service on AWS is a full-managed, managed, managed service. There's a management team, there's people to monitor the cluster, there's someone to monitor the cluster, there's someone to review the cluster, etc. RHT OpenShift Cloud Platform is a full-managed service where you can build your own virtual machine on top of someone else's machine. There's nobody monitoring your cluster, nobody checking it. There's nobody to monitor your cluster, there"

Lets try another one, 

In [26]:
query = 'What is the purpose of this document?'

context = text[:3000] # Here I am taking the context as initial 2000 characters from the text doc. 

In [27]:
generated_text = quan_pipeline(f"question: {query} ? context: {context}",
                               num_beams=4,
                               do_sample=True,
                               temperature=1.5,
                               max_length=100) #Increasing the length

generated_text = generated_text[0]['generated_text']

In [28]:
generated_text

" It's a manual for Redhat OpenShift. It's useful if you're planning on using Redhat in a large enterprise environment. I've never heard of it before, but after reading this manual I'm wondering what purpose it's for. It seems like it's a good idea, but I've never had a need to use it."

# Conclusion

In this notebook, we tried to explore transformers models which would be efficient for creating a QA pipeline. In the first case, we tried to look into extractive QA, where the answer for the given query was searched from the context provided. In this case, the answers are short and sometimes does not make much sense. 

In the second case, we tried to build generative QA pipeline by training a seq2seq model, where for a given query and context. It will be able to generate text which might be close to the actual answer and also more like human written sentence.  