## Question & Answer Pair Extraction
In this notebook, I extract the test questions and their corresponding answers from the corpus by first running a question generation model on the corpus, and then manually editing, answering and ranking the questions. 

In [1]:
!pip install transformers llama-index protobuf SentencePiece llama-index-readers-pdf-table



### Load in the data using Simple Directory Reader

In [1]:
from llama_index.core import SimpleDirectoryReader

# loading in data
reader = SimpleDirectoryReader(input_dir='data')
data = reader.load_data()

Extract from document object and put the contents of each page in a list, clean up with regex

In [2]:
import regex as re

input_text = []
for i in range(16,184):
    page = data[i].text
    page = re.sub(r'\s+', ' ', page)
    input_text.append(page)

### Question generation part
Import the question generation model and run it on the input text. We divide each page into five and generate a question from each part; resulting in 5 questions per page.

In [12]:
from transformers import T5Config, T5ForConditionalGeneration, T5Tokenizer

model_name = "allenai/t5-small-squad2-question-generation"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    res = model.generate(input_ids, **generator_args)
    output = tokenizer.batch_decode(res, skip_special_tokens=True)
    print(output)
    return output


all_questions = []

for i in range(len(input_text)):
  n_parts = 5
  length = len(input_text[i])
  page_text = input_text[i]
  part_size = length // n_parts  # Approximate size of each part

  # Create slices
  parts = [page_text[j:j+part_size] for j in range(0, length, part_size)]

  # Ensure exactly 5 parts (combine excess characters into the last part)
  if len(parts) > n_parts:
      parts = parts[:n_parts-1] + [''.join(parts[n_parts-1:])]

  page_questions = [run_model(part) for part in parts]
  all_questions.append(page_questions)

tokenizer_config.json:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



['What is the annual WASHNORM survey?']
['What was the first WASHNORM conducted in 2018?']
['What was the second edition of the WASHNORM survey conducted between August 2019 and February 2020']
['What was added to the household module on COVID 19?']
['What is the basis for a systematic monitoring of prog ress towards attainment of key']
['What is the basis for developing state-wide WASH investment plans?']
['What is the name of the survey report presented in?']
['What is the detailed methodology of the WASHNORM 2021 survey?']
['What was the data collection conducted in the third and fourth quarter of 2021?']
['How many EAs were covered in the NISH sampling frame?']
['How many households were canvassed in each of the 34 states and FCT?']
['What was the name of the sampling frames developed?']
['How many primary and secondary schools were listed across the country?']
['How many households were covered out of the targeted 7,400?']
['What is the actual number of samples assessed for the 20

Pickle them so we don't have to regenerate every time

In [13]:
import pickle

# Save documents as a pickle file
output_path = "all_questions.pkl"
with open(output_path, "wb") as f:
    pickle.dump(all_questions, f)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Load them back in and write them into a .txt file

In [1]:
with open('all_questions.pkl', 'rb') as f:
        questions = pickle.load(f)

In [15]:
import regex as re
output_file = "all_questions.txt"
clean_questions = ''
for question in questions:
    for quest in question:
        clean_questions += str(quest) + '\n'

clean_questions = re.sub(r'[\[\]]','',clean_questions)

with open(output_file,'w') as f:
    f.write(clean_questions)


In [21]:
output_file = "all_small_dataset_questions.txt"
small_clean_questions = ''
for question in questions[:13]:
    for quest in question:
        small_clean_questions += str(quest) + '\n'

small_clean_questions = re.sub(r'[\[\]]','',small_clean_questions)

with open(output_file,'w') as f:
    f.write(small_clean_questions)

### Answer generation part
Now we generate the answers by running a summarization model on each page divided by 5; using the same parts as the question so, the summary of each part will correspond to the answer of the question. 

Actually, the summaries weren't so good so I answered manually by matching up each question with the passage it came from in the report.

Building a dictionary structure to hold the questions amswers and difficulty

In [3]:
questions_file = "small_dataset_questions.txt"
answers_file = "small_dataset_answers.txt"

test_data = {}

with open(questions_file, "r") as qf, open(answers_file, "r") as af:
    questions = qf.readlines()
    answers = af.readlines()

for idx, (question, answer_line) in enumerate(zip(questions, answers), start=1):
    question = question.strip()
    answer_line = answer_line.strip()
    question = question[1:-1]

    # Split the answer line into difficulty and answers
    difficulty, answer_text = answer_line.split(",",1)
    difficulty = int(difficulty)
    answer_list = [answer_text[1:-1]]

    # Store in dictionary
    test_data[idx] = {
        "question": question,
        "difficulty": difficulty,
        "answers": answer_list
    }

# Output the dictionary
print(test_data)

{1: {'question': 'What does WASHNORM stand for?', 'difficulty': 2, 'answers': ['WASHNORM stands for Water Sanitation and Hygiene National Outcome Routine Mapping']}, 2: {'question': 'What is the annual WASHNORM survey?', 'difficulty': 1, 'answers': ['The annual WASHNORM survey is an annual national household and facility-based survey encompassing a comprehensive range of key outcome indicators and parameters related to the WASH sector.']}, 3: {'question': 'What changes were made to the scope and methodology of WASHNORM 2019 in WASHNORM 2021?', 'difficulty': 3, 'answers': ["'A few questions on COVID 19 were added to the household module and the public utility mapping was expanded to include the energy requirement of urban waterworks."]}, 4: {'question': 'What are some key objectives of WASHNORM?', 'difficulty': 2, 'answers': ["'To provide up-to-date and detailed data, gain insight from different stakeholders, monitor and track WASH outcomes, strengthen WASH institutions."]}, 5: {'questi

In [4]:
# get in second set of answers
new_answers_file = "second_answers.txt"

with open(new_answers_file, "r") as naf:
    new_answers = naf.readlines()

for idx, answer_line in enumerate(new_answers, start=1):
    
    additional_answers = [ans.strip() for ans in answer_line.strip().split("|")]
    
    # Append new answers to the existing list
    test_data[idx]["answers"].extend(additional_answers)

# Updated dictionary now contains both sets of answers
print(test_data)

{1: {'question': 'What does WASHNORM stand for?', 'difficulty': 2, 'answers': ['WASHNORM stands for Water Sanitation and Hygiene National Outcome Routine Mapping', 'WASHNORM stands for Water, Sanitation, and Hygiene National Outcome Routine Mapping.', 'WASHNORM is an acronym for Water, Sanitation, and Hygiene National Outcome Routine Mapping.', 'The full form of WASHNORM is Water Sanitation and Hygiene National Outcome Routine Mapping.']}, 2: {'question': 'What is the annual WASHNORM survey?', 'difficulty': 1, 'answers': ['The annual WASHNORM survey is an annual national household and facility-based survey encompassing a comprehensive range of key outcome indicators and parameters related to the WASH sector.', 'The annual WASHNORM survey collects data on national households and facilities, covering key WASH sector indicators.', 'WASHNORM is an annual survey that gathers household and facility data to track WASH sector outcomes.', 'Each year, the WASHNORM survey collects a variety of ou

In [7]:
import pickle
# Save documents as a pickle file
output_path = "test_data.pkl"
with open(output_path, "wb") as f:
    pickle.dump(test_data, f)