## Rule-based question generation

The commands here execute [Hielman & Smith's question generation system](http://www.cs.cmu.edu/~ark/mheilman/questions/), which uses rule-based transformations to syntactic parse trees to derive questions from sentences, where each question has an answer contained in the transformed sentence. Additionally, the approach scores the generated questions based on a statistical model of quality. The result of the code here is a dataset with input sentence and question pairs (answers highlighted in the input sentence) along with their scores.

Additionally, I perform named entity tagging on the sentence-question pairs in order to mask tokens with their corresponding entity tags, since this entity masking is a component of our question generation approach. The entity-tagged sentences and questions are additional columns in the dataset.

In [1]:
import pandas
import pickle
import os

pandas.set_option('display.max_colwidth', -1)
pandas.set_option('display.max_rows', 500)

### Question Generation on SQuAD dataset

Because the Java code for the rule-based system takes a long time to execute, I created a script, rule_based_question_generation.py, so that multiple analyses can be run in parallel. This script launches the Java code as a subprocess. Before running the script, you should start the supersense tagging and parsing servers using the README instructions in /home/mroemmele/question_generation/QuestionGeneration. 

To run the parsing server:
`java -Xmx1200m -cp question-generation.jar edu.cmu.ark.StanfordParserServer --grammar config/englishFactored.ser.gz --port 5556 --maxLength 40`

To run the supersense tagging server:
`java -Xmx500m -cp lib/supersense-tagger.jar edu.cmu.ark.SuperSenseTaggerServer  --port 5557 --model config/superSenseModelAllSemcor.ser.gz --properties config/QuestionTransducer.properties.1`

Start multiple servers by varying the port number. These port numbers need to be consistent with the ones specified in config/QuestionTransducer.properties.{X}. There is a properties file for each unique pair of supersenseServerPort and parserServerPort specifications.

Each properties file can be provided as input (via the -config_file parameter) to a different run of rule_based_question_generation.py so that each run will use the corresponding sense and parsing servers in that config. The script takes a file (-input_file) of newline-separated texts as inputs from which questions will be generated. Specify which chunk of texts each run should process using the start_idx and end_idx parameters, which correspond to line indices in the input file. The script will process texts will these indices specifically and then write the questions generated from these texts to a file {save_prefix}_{start_idx}_{end_idx}.csv. Here is an example:

`python rule_based_question_generation.py -input_file  ~/question_generation/newsqa_untok_data/train/unique_paragraphs.txt -config_file ~/question_generation/HS-QuestionGeneration/config/QuestionTransducer.properties.1 -start_idx 0 -end_idx 10 -save_prefix ~/question_generation/squad_data/hs_output_sample`

In [3]:
import data_utils
import importlib
importlib.reload(data_utils)

W1011 00:30:37.414754 140482265536320 deprecation_wrapper.py:119] From /.auto/home6/mroemmele/CoreNLP/sandbox/entity_QA/data_utils.py:6: The name tf.logging.ERROR is deprecated. Please use tf.compat.v1.logging.ERROR instead.



<module 'data_utils' from '/.auto/home6/mroemmele/CoreNLP/sandbox/entity_QA/data_utils.py'>

## Concatenate multiple ruled-based qg output dataframes into one

In [2]:
data = pandas.concat([pandas.read_csv("/home/mroemmele/question_generation/newsqa_rule_generated_data/train/full_dataframe_0_to_2500.csv"),
pandas.read_csv("/home/mroemmele/question_generation/newsqa_rule_generated_data/train/full_dataframe_2500_to_5000.csv"),
pandas.read_csv("/home/mroemmele/question_generation/newsqa_rule_generated_data/train/full_dataframe_5000_to_7500.csv"),
pandas.read_csv("/home/mroemmele/question_generation/newsqa_rule_generated_data/train/full_dataframe_7500_to_10000.csv"),
pandas.read_csv("/home/mroemmele/question_generation/newsqa_rule_generated_data/train/full_dataframe_10000_to_12500.csv")])
data

Unnamed: 0,text_id,question,answer_sent,answer,score
0,0,What dubbed ``the house of horrors?'',"NEW DELHI, India-- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed ``the house of horrors.''",NEW DELHI-- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case,3.620755
1,0,Who told CNN the Allahabad high court has acquitted Moninder Singh Pandher?,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",Sikandar B. Kochar,3.251408
2,0,What was Moninder Singh Pandher sentenced to death by?,Moninder Singh Pandher was sentenced to death by a lower court in February.,by a lower court in February,3.146476
3,0,Who did Sikandar B. Kochar tell CNN the Allahabad high court has acquitted?,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",Moninder Singh Pandher,3.123501
4,0,What did Sikandar B. Kochar tell CNN has acquitted Moninder Singh Pandher?,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",the Allahabad high court,3.074390
...,...,...,...,...,...
167424,11468,What do we 've gotten?,We 've gotten blatant lies and excuses.,blatant lies and excuses,0.700773
167425,11468,When did Cooey say?,"A federal appeals court ruled Thursday that Cooey waited too long to raise the medical issues, saying he ``knew of and could have filed suit over vein access prior to July 2005.''",he ``knew of and could have filed suit over vein access prior to July 2005'',0.625771
167426,11468,What is that inmate still on?,That inmate is still on death row.,on death row,0.585342
167427,11468,What does he say to?,"He says his veins are weakened because of his health issues, and the lethal drugs would amount to cruel and unusual punishment.",the lethal drugs would amount to cruel and unusual punishment,0.259447


In [3]:
'''Save dataset'''

data.to_csv("/home/mroemmele/question_generation/newsqa_rule_generated_data/train/full_dataframe_all.csv")

#### Insert answer annotations into input texts

In [5]:
'''Load the output from the rule-based question generation system'''

partition = 'train'
qg_data = pandas.read_csv('/home/mroemmele/question_generation/newsqa_rule_generated_data/train/full_dataframe_all.csv')
qg_data[:25]

Unnamed: 0.1,Unnamed: 0,text_id,question,answer_sent,answer,score
0,0,0,What dubbed ``the house of horrors?'',"NEW DELHI, India-- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed ``the house of horrors.''",NEW DELHI-- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case,3.620755
1,1,0,Who told CNN the Allahabad high court has acquitted Moninder Singh Pandher?,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",Sikandar B. Kochar,3.251408
2,2,0,What was Moninder Singh Pandher sentenced to death by?,Moninder Singh Pandher was sentenced to death by a lower court in February.,by a lower court in February,3.146476
3,3,0,Who did Sikandar B. Kochar tell CNN the Allahabad high court has acquitted?,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",Moninder Singh Pandher,3.123501
4,4,0,What did Sikandar B. Kochar tell CNN has acquitted Moninder Singh Pandher?,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",the Allahabad high court,3.07439
5,5,0,What did Sikandar B. Kochar tell the Allahabad high court has acquitted Moninder Singh Pandher?,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",CNN,3.068218
6,6,0,What did Kochar say Pandher was summoned as co-accused during?,"Pandher was not named a main suspect by investigators initially, but was summoned as co-accused during the trial, Kochar said.",during the trial,3.014826
7,7,0,Who was sentenced to death by a lower court in February?,Moninder Singh Pandher was sentenced to death by a lower court in February.,Moninder Singh Pandher,2.946347
8,8,0,Who said Pandher was summoned as co-accused during the trial?,"Pandher was not named a main suspect by investigators initially, but was summoned as co-accused during the trial, Kochar said.",Kochar,2.941241
9,9,0,What was Moninder Singh Pandher sentenced to by a lower court in February?,Moninder Singh Pandher was sentenced to death by a lower court in February.,to death,2.930043


In [5]:
len(qg_data)#, len(answer_start_chars), len(answer_end_chars)

264

In [6]:
'''Annotate answers in input texts'''


def get_answer_annotated_data(data):
    filtered_questions = []
    filtered_answer_sents = []
    filtered_answers = []

    for idx, (question, answer_sent, answer) in enumerate(data[['question', 'answer_sent', 'answer']].values):
        answer_start_char = None
        while answer_start_char == None:
            try:
                answer_start_char = answer_sent.lower().index(answer.lower())
            except:
                if " " not in answer:  # Answer not found
                    break
                # Trim leading word from answer to see if subsegment can be found
                answer = answer[answer.index(" ") + 1:]
        if answer_start_char is not None:
            #         answer_start_chars.append(answer_start_char)
            answer_end_char = answer_start_char + len(answer)
            answer_sent = (answer_sent[:answer_start_char] + "<ANSWER> "
                           + answer_sent[answer_start_char:])
            answer_sent = (answer_sent[:answer_end_char + len("<ANSWER> ")]
                           + " </ANSWER>" +
                           answer_sent[answer_end_char + len(" <ANSWER>"):])
            filtered_answer_sents.append(answer_sent)
            filtered_questions.append(question)
            filtered_answers.append(answer)
    return {'answer_sent': filtered_answer_sents,
            'question': filtered_questions,
            'answer': filtered_answers}


In [7]:
annotated_qg_data = get_answer_annotated_data(qg_data)
pandas.DataFrame(annotated_qg_data)[:100]

Unnamed: 0,answer_sent,question,answer
0,"NEW DELHI, India-- <ANSWER> A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case </ANSWER> dubbed ``the house of horrors.''",What dubbed ``the house of horrors?'',A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case
1,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer <ANSWER> Sikandar B. Kochar </ANSWER> told CNN.",Who told CNN the Allahabad high court has acquitted Moninder Singh Pandher?,Sikandar B. Kochar
2,Moninder Singh Pandher was sentenced to death <ANSWER> by a lower court in February </ANSWER>.,What was Moninder Singh Pandher sentenced to death by?,by a lower court in February
3,"The Allahabad high court has acquitted <ANSWER> Moninder Singh Pandher </ANSWER>, his lawyer Sikandar B. Kochar told CNN.",Who did Sikandar B. Kochar tell CNN the Allahabad high court has acquitted?,Moninder Singh Pandher
4,"<ANSWER> The Allahabad high court </ANSWER> has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.",What did Sikandar B. Kochar tell CNN has acquitted Moninder Singh Pandher?,the Allahabad high court
5,"The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told <ANSWER> CNN </ANSWER>.",What did Sikandar B. Kochar tell the Allahabad high court has acquitted Moninder Singh Pandher?,CNN
6,"Pandher was not named a main suspect by investigators initially, but was summoned as co-accused <ANSWER> during the trial </ANSWER>, Kochar said.",What did Kochar say Pandher was summoned as co-accused during?,during the trial
7,<ANSWER> Moninder Singh Pandher </ANSWER> was sentenced to death by a lower court in February.,Who was sentenced to death by a lower court in February?,Moninder Singh Pandher
8,"Pandher was not named a main suspect by investigators initially, but was summoned as co-accused during the trial, <ANSWER> Kochar </ANSWER> said.",Who said Pandher was summoned as co-accused during the trial?,Kochar
9,Moninder Singh Pandher was sentenced <ANSWER> to death </ANSWER> by a lower court in February.,What was Moninder Singh Pandher sentenced to by a lower court in February?,to death


In [9]:
len(pandas.DataFrame(annotated_qg_data))

1223214

In [11]:
'''Save the dataset'''

data_dir = "/home/mroemmele/question_generation/newsqa_rule_generated_data/untok_data/"

if not os.path.isdir(data_dir):
    os.mkdir(data_dir)

if not os.path.isdir(os.path.join(data_dir, partition)):
    os.mkdir(os.path.join(data_dir, partition))
    
with open(os.path.join(data_dir, partition, 'answer_sents.txt'), 'w') as f:
    f.write("\n".join(annotated_qg_data['answer_sent']))

with open(os.path.join(data_dir, partition, 'questions.txt'), 'w') as f:
    f.write("\n".join(annotated_qg_data['question']))
    
with open(os.path.join(data_dir, partition, 'answers_only.txt'), 'w') as f:
    f.write("\n".join(annotated_qg_data['answer']))

# Code below was moved to data_prep.ipynb

## Create a validation set from existing training dataset

Use this set for validation when training question generation model, so that provided validation data can instead be used as a held-out test set.

In [None]:
data_dir = '''/home/mroemmele/question_generation/newsqa_untok_data'''
percent_heldout = 0.02


In [None]:
'''Load training data'''

paragraphs = [paragraph.strip()
              for paragraph in open(os.path.join(data_dir, 'train', 'paragraphs.txt'))]
answer_sents = [sent.strip()
                for sent in open(os.path.join(data_dir, 'train', 'answer_sents.txt'))]
questions = [question.strip()
             for question in open(os.path.join(data_dir, 'train', 'questions.txt'))]
answers = [answer.strip()
           for answer in open(os.path.join(data_dir, 'train', 'answers_only.txt'))]

In [None]:
'''Randomly select indices of heldout data items or load them from file'''

rand_idxs = list(range(len(answer_sents)))
random.shuffle(rand_idxs)
n_heldout = int(percent_heldout * len(answer_sents))
heldout_idxs = set(rand_idxs[:n_heldout])

# with open("/home/mroemmele/question_generation/squad_rule_mimic_heldout_train_idxs_2_percent.pkl", 'rb') as f:
#     heldout_idxs = pickle.load(f)

list(heldout_idxs)[:10]

In [None]:
'''Save the random indices so they can be used for other datasets'''

with open(os.path.join(data_dir, "newsqa_heldout_train_idxs_2_percent.pkl"), 'wb') as f:
    pickle.dump(heldout_idxs, f)

In [None]:
'''Partition out validation data from training data by randomly selecting items'''

train_paragraphs, train_answer_sents, train_questions, train_answers = [], [], [], []
valid_paragraphs, valid_answer_sents, valid_questions, valid_answers = [], [], [], []

for idx, (paragraph, answer_sent,
          question, answer) in enumerate(zip(paragraphs, answer_sents, questions, answers)):
    #import pdb;pdb.set_trace()
    if idx in heldout_idxs:
        valid_paragraphs.append(paragraph)
        valid_answer_sents.append(answer_sent)
        valid_questions.append(question)
        valid_answers.append(answer)
    else:
        train_paragraphs.append(paragraph)
        train_answer_sents.append(answer_sent)
        train_questions.append(question)
        train_answers.append(answer)

assert len(train_paragraphs) == len(train_answer_sents) == len(train_questions) == len(train_answers)
assert len(valid_paragraphs) == len(valid_answer_sents) == len(valid_questions) == len(valid_answers)

## Code below is just scratchpad

#### Question generation on a single text

In [47]:
with open("/home/mroemmele/misc_test_texts/nmt_blog_concat.txt") as f:
    text = f.read()

In [50]:
'''Switch to directory that contains H&S code'''

os.chdir("/home/mroemmele/question_generation/HS-QuestionGeneration/")

In [53]:
process = subprocess.Popen(["java", "-Xmx1200m",
                            "-cp", "question-generation.jar",
                            "edu/cmu/ark/QuestionAsker",
                            "--verbose", "--model", "models/linear-regression-ranker-reg500.ser.gz",
                            "--just-wh", "--max-length", "30", "--downweight-pro"],
                           stdin=subprocess.PIPE, stdout=subprocess.PIPE)
qg_output = [output.split("\t") for output in
             process.communicate(text.encode())[0].decode('utf-8').strip().split("\n")]

In [330]:
'''View output as data frame'''

qg_output = pandas.DataFrame(qg_output)#[[1, 0]]

Unnamed: 0,0,1,2,3
0,What does SDL ETS use Edge-Cloud capabilities to seamlessly integrate with in addition?,"In addition, it now uses Edge-Cloud capabilities to seamlessly integrate with on-premise solutions to provide secure ground to Cloud deployment options for brands with Linguistic AI capabilities, powered by Hai.",with on-premise solutions to provide secure ground to Cloud deployment options for brands with Linguistic AI capabilities,3.5378067735475214
1,Who is the new SDL NMT 2. 0 Russian engine being made available to?,"The new SDL NMT 2. 0 Russian engine is being made available to enterprise customers via SDL Enterprise Translation Server, a secure NMT product, enabling organizations to translate large volumes of information into multiple languages.",to enterprise customers via SDL Enterprise Translation Server enabling organizations to translate large volumes of information into multiple languages,3.3519041332409145
2,What is being made available to enterprise customers via SDL Enterprise Translation Server enabling organizations to translate large volumes of information into multiple languages?,"The new SDL NMT 2. 0 Russian engine is being made available to enterprise customers via SDL Enterprise Translation Server, a secure NMT product, enabling organizations to translate large volumes of information into multiple languages.",the new SDL NMT 2. 0 Russian engine,3.335270846081593
3,What uses Edge-Cloud capabilities to seamlessly integrate with on-premise solutions to provide secure ground to Cloud deployment options for brands with Linguistic AI capabilities in addition?,"In addition, it now uses Edge-Cloud capabilities to seamlessly integrate with on-premise solutions to provide secure ground to Cloud deployment options for brands with Linguistic AI capabilities, powered by Hai.",SDL ETS,3.2294493511791784
4,"Who said, ``SDL is revolutionizing the way brands understand and engage with worldwide audiences with its Neural Machine Translation technology''?","``SDL is revolutionizing the way brands understand and engage with worldwide audiences with its Neural Machine Translation technology,'' said Mihai Vlad, VP of Machine Learning and AI, SDL.",Mihai Vlad,3.1703440059595995
5,Who was VP of Machine Learning and AI?,"``SDL is revolutionizing the way brands understand and engage with worldwide audiences with its Neural Machine Translation technology,'' said Mihai Vlad, VP of Machine Learning and AI, SDL.",Mihai Vlad,2.9512320390682376
6,What was Mihai Vlad VP of?,"``SDL is revolutionizing the way brands understand and engage with worldwide audiences with its Neural Machine Translation technology,'' said Mihai Vlad, VP of Machine Learning and AI, SDL.",of Machine Learning and AI,2.8756886654113187
7,Who has selected SDL Enterprise Translation Server to translate its online store?,"com, a leading b2b e-commerce platform connecting buyers with Chinese suppliers, has selected SDL Enterprise Translation Server to translate its online store, which offers over 30 million products to a global audience.",com,2.8697999572430906
8,What is Ltd.?,It is a key software company in China's national planning layout and one of the first batch of national IT pilot companies.,a key software company in China's national planning layout and one of the first batch of national IT pilot companies,2.833576952645507
9,"Who is a highly inflected language with different syntax, grammar, and word order compared to English?","Russian is a highly inflected language with different syntax, grammar, and word order compared to English.",Russian,2.829255828560276


In [None]:
'''Save output'''

qg_output.to_csv("outputs/nmt_blog_concat.csv", index=False)

In [329]:
'''Print output'''

for sent, question in pandas.DataFrame(qg_output)[:71][[1, 0]].values:
    print("INPUT:", sent)
    print("RULE QUESTION:", question)
    print()

INPUT: In addition, it now uses Edge-Cloud capabilities to seamlessly integrate with on-premise solutions to provide secure ground to Cloud deployment options for brands with Linguistic AI capabilities, powered by Hai.
RULE QUESTION: What does SDL ETS use Edge-Cloud capabilities to seamlessly integrate with in addition?

INPUT: The new SDL NMT 2. 0 Russian engine is being made available to enterprise customers via SDL Enterprise Translation Server, a secure NMT product, enabling organizations to translate large volumes of information into multiple languages.
RULE QUESTION: Who is the new SDL NMT 2. 0 Russian engine being made available to?

INPUT: The new SDL NMT 2. 0 Russian engine is being made available to enterprise customers via SDL Enterprise Translation Server, a secure NMT product, enabling organizations to translate large volumes of information into multiple languages.
RULE QUESTION: What is being made available to enterprise customers via SDL Enterprise Translation Server ena