# Title: DISFL-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

#### Group Member Names : Yug Chandak 200552956, Samuel-Duke Ezeofor 200562683


### INTRODUCTION:
*********************************************************************************************************************
#### AIM : To correctly read general questions containing disfluences in human's natural way of speaking.

*********************************************************************************************************************
#### Github Repo: https://github.com/google-research-datasets/disfl-qa.

*********************************************************************************************************************
#### DESCRIPTION OF PAPER:
The BERT and T5 models uses datasets of questions with introduced disfluencies not present in the original questions to learn how to remove disfluencies from questions and produce answers to the questions.
*********************************************************************************************************************
#### PROBLEM STATEMENT :
* Try to replicate the results given in paper on different datasets with a different question-answering model.
* Introduce a different model to answer the questions from which the disfluencies have been removed.
* Choose a story dataset for the model to learn and answer questions about.
(link : "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip")
*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
* The topic of disfluencies does not get enough attention in NLP, even though it is ever-present in human conversation.
* This is mostly because of the lack of datasets containing disfluencies.
* This is a bit of an issue because when humans use voice input on language models they'd have to be mostly fluent every time else the models misunderstand and/or misinterpreat the information.
*********************************************************************************************************************
#### SOLUTION:
* This research was done using a database of original questions and the same questions with removed disfluencies.
* The two models BERT and T5 were both trained to remove disfluencies from questions.


# Background
*********************************************************************************************************************
#### Disfluency removal

|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
|Aditya G. et. al.[1]|Understanding Disfluencies in Question Answering |Human annotated datasets: SQUAD-v1, SQUAD-v2, and DISFL-QA. |contains diverse set of disfluencies rooted in context, particularly a large fraction of corrections and restarts which the two models are not robust to |


# Methodology
*********************************************************************************************************************
* Using QA models: BERT-QA and T5-QA
* They evaluated QA models performance using the standard SQUAD-v2 evaluation script which reports EM and F1 scores
*  they reported only the F1 numbers because they observed similar trends in EM and F1 across our
experiments.



# Implementation
********************************************************************************************************************
#### Aiming to answer these research questions:
* Are state-of-theart LM based QA models robust to introduction of disfluencies in the questions under a zero-shot setting?
* Can we use heuristically generated synthetic disfluencies to aid the training of QA models to handle disfluencies?
* Given a small amount of labeled data, can we recover performance by fine-tuning the QA models or training a disfluency correction model to pre-process the disfluent questions into fluent ones before inputting to the QA models?
* In the above setting, can we train a generative model to generate more disfluent training data ?


### Note: The paper only provided the datasets used and not a link to the codes for their research.
> So the code from here on is our own contribution

#### NECESSARY LIBRARIES

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 9.7 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
Collecting farm-haystack[colab,inference]
  Obtaining dependency information for farm-haystack[colab,inference] from https://files.pythonhosted.org/packages/df/ef/485cd648ee02afafd5c014b609c214299507112c246b75303f91fd2c139f/farm_haystack-1.19.0-py3-none-any.whl.metadata
  Downloading farm_haystack-1.19.0-py3-none-any.whl.metadata (25 kB)
Collecting boilerpy3 (from farm-haystack[colab,inference])
  Downloading boilerpy3-1.0.6-py3-none-any.whl (22 kB)
Collecting canals==0.3.2 (from farm-haystack[colab,inference])
  Obtaining dependency information for canals==0.3.2 from https://files.pythonhosted.org/packages/b8/f6/6d2071a20400129a72390f0



### Enabling Telemetry
Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

In [None]:
from haystack.telemetry import tutorial_running

tutorial_running(1)
#logging level to INFO:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions.

In [None]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


## Preparing Documents

1. Download 517 articles from the Game of Thrones Wikipedia.

In [None]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_your_first_question_answering_system"

fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
    output_dir=doc_dir,
)

INFO:haystack.utils.import_utils:Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip to 'data/build_your_first_question_answering_system'


True

2. Using `TextIndexingPipeline` to convert the files you just downloaded into Haystack Document objects and write them into the DocumentStore:

In [None]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.
Converting files: 100%|██████████| 183/183 [00:04<00:00, 41.79it/s]
Preprocessing: 100%|██████████| 183/183 [00:05<00:00, 32.42docs/s]
Updating BM25 representation...: 100%|██████████| 2359/2359 [00:00<00:00, 5415.65 docs/s]


{'documents': [<Document: {'content': '\n\n"\'\'\'The Last of the Starks\'\'\'" is the fourth episode of the eighth season of HBO\'s fantasy television series \'\'Game of Thrones\'\', and the 71st overall. It was written by David Benioff and D. B. Weiss, and directed by David Nutter. It aired on May 5, 2019.\n\n"The Last of the Starks" shows the aftermath of the battle against the Army of the Dead while setting the stage for the final confrontation, with Daenerys, Jon, and their remaining forces going towards King\'s Landing to confront Cersei and demand her surrender.\n\nThe episode received mixed reviews. Critics praised its return to the political intrigue of earlier \'\'Game of Thrones\'\' episodes, but criticized the episode\'s writing. It received a Primetime Emmy Award nomination for Outstanding Directing for a Drama Series and was picked by Emilia Clarke to support her nomination for Outstanding Lead Actress in a Drama Series.\n\n', 'content_type': 'text', 'score': None, 'meta'

## Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question. This tutorial uses the BM25 algorithm.

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

**Initializing the Reader**
A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text.

In [None]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


## Creating the Retriever-Reader Pipeline

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Asking a Question

1. Use the pipeline `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see Arguments. To understand the importance of the `top-k` parameter, see Choosing the Right top-k Values
.

In [None]:
prediction = pipe.run(
    query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples: 100%|██████████| 1/1 [00:18<00:00, 18.71s/ Batches]


Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?

**Results**

2. Print out the answers the pipeline returned:




In [None]:
from haystack.utils import print_answers

print_answers(prediction, details="minimum")  ## Choose from `minimum`, `medium`, and `all`

'Query: Who is the father of Arya Stark?'
'Answers:'
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': 'k in the television series.\n'
                   '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's h"},
    {   'answer': 'Lord Eddard Stark',
        'context': 'rk daughters.\n'
                   '\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Ned',
        'context': ' girl disguised as a boy all along and is surprised to '
      

*********************************************************************************************************************
### My Additions in the Project :
* Researched a QA model (roberta-base-squad2) from FARMReader to learn from stories in form of documents to answer Questions about them, these questions would have had the disfluencies removed.

### Conclusion and Future Direction :
*******************************************************************************************************************************
#### Learnings :
We learnt how to replicate a paper using github repo.

*******************************************************************************************************************************
#### Results Discussion :
QA models are applicable in a lot of scenarios and when asked questions without complex disfluencies that are able to produce accurate answers.

*******************************************************************************************************************************
#### Limitations :

* NLP research is not yet so advanced as to interpret pure human communication with all the disfluencies. Especially ones that are rooted in context such as restarts and subject matter corrections.


*******************************************************************************************************************************
#### Future Extension :
The field of NLP is fast developing and dealing with complex disfluencies will get easier.

# References:

[1]:  Aditya G, Jiacheng Xu, Shyam U., Diyi Y., Manaal F., (2021). DISFL-QA: A Benchmark Dataset for Understanding Disfluencies
in Question Answering. arXiv:2106.04016v1 [cs.CL]