# GuideLines



# Generating Multiple-Choice Questions from Text

Goal:

The goal of this notebook is to demonstrate the process of generating multiple-choice questions (MCQs) from a given text input. This includes preprocessing the text, utilizing language models for question generation, and presenting the generated questions along with their options.

Output:

The notebook generates MCQs based on the provided input text, offering multiple-choice options for each question. These questions are designed to assess understanding and comprehension of the text.

Models Used:

T5 Model (Text-To-Text Transfer Transformer):

Used for conditional text generation.

Specifically fine-tuned for generating questions from context.

Sense2Vec Model:  (Advance Version of Word2Vec)

Utilized for retrieving sense-aware word vectors, aiding in the generation of meaningful options for MCQs.

Preprocessing Steps:

Text Tokenization:

The input text is tokenized into sentences and words to facilitate further processing.

Stopword Removal:

Common stopwords are removed to filter out irrelevant words from consideration.

Part-of-Speech Tagging:

Part-of-speech tagging is performed to identify nouns and proper nouns, which are essential for generating meaningful questions and options.

Candidate Selection:

MultipartiteRank algorithm is employed to select candidate keywords and phrases from the text.

Example Usage:

The notebook includes a PythonPredictor class with methods for predicting MCQs from a given input text. Users can instantiate this class, provide their input text, and obtain the generated MCQs with options as output.


Note:

The provided code and methodology are intended for educational and demonstration purposes only. Users are encouraged to further customize and adapt the codebase to their specific requirements and use cases.

# Important Installation

In [1]:
!pip install sense2vec
!pip install strsim
!pip install git+https://github.com/boudinfl/pke.git
!python -m nltk.downloader stopwords

!python -m nltk.downloader universal_tagset
!python -m spacy download en # download the english model

Collecting sense2vec
  Using cached sense2vec-2.0.2-py2.py3-none-any.whl.metadata (54 kB)
Downloading sense2vec-2.0.2-py2.py3-none-any.whl (40 kB)
Installing collected packages: sense2vec
Successfully installed sense2vec-2.0.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting strsim
  Using cached strsim-0.0.3-py3-none-any.whl.metadata (19 kB)
Using cached strsim-0.0.3-py3-none-any.whl (42 kB)
Installing collected packages: strsim
Successfully installed strsim-0.0.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting git+https://github.com/boudinfl/pke.git
  Cloning https

# Load Modules and Packages

In [5]:
import time
import torch
import numpy
from transformers import T5ForConditionalGeneration,T5Tokenizer
import random
import spacy
import boto3
import zipfile
import os
import json
from sense2vec import Sense2Vec
import requests
from collections import OrderedDict
import string

import pke
import nltk
from nltk import FreqDist
nltk.download('brown')
nltk.download('stopwords')
nltk.download('popular')
from nltk.corpus import stopwords
from nltk.corpus import brown
from similarity.normalized_levenshtein import NormalizedLevenshtein

from nltk.tokenize import sent_tokenize


[nltk_data] Downloading package brown to
[nltk_data]     /var/home/ramrshrcg/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /var/home/ramrshrcg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /var/home/ramrshrcg/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /var/home/ramrshrcg/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /var/home/ramrshrcg/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /var/home/ramrshrcg/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-dat

In [6]:
from flashtext import KeywordProcessor


In [7]:
!ls

'Fine Tuning Quesition and MCQs Generator Using Sense2Vec T5 Model.ipynb'
'MCQ Question Generator My Model'
'MCQ Question Generator My Model.ipynb'
'Question MCQs Generation Simple One T5 Model.ipynb'
'Question MCQs Generator Tutorials.ipynb'
 SmartStoplist.txt


In [8]:
def MCQs_available(word,s2v):
    word = word.replace(" ", "_")
    sense = s2v.get_best_sense(word)
    if sense is not None:
        return True
    else:
        return False

In [9]:
def edits(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz '+string.punctuation
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

In [10]:
def sense2vec_get_words(word,s2v):
    output = []

    word_preprocessed =  word.translate(word.maketrans("","", string.punctuation))
    word_preprocessed = word_preprocessed.lower()

    word_edits = edits(word_preprocessed)

    word = word.replace(" ", "_")

    sense = s2v.get_best_sense(word)
    most_similar = s2v.most_similar(sense, n=15)

    compare_list = [word_preprocessed]
    for each_word in most_similar:
        append_word = each_word[0].split("|")[0].replace("_", " ")
        append_word = append_word.strip()
        append_word_processed = append_word.lower()
        append_word_processed = append_word_processed.translate(append_word_processed.maketrans("","", string.punctuation))
        if append_word_processed not in compare_list and word_preprocessed not in append_word_processed and append_word_processed not in word_edits:
            output.append(append_word.title())
            compare_list.append(append_word_processed)


    out = list(OrderedDict.fromkeys(output))

    return out

In [11]:
# Return an array of options
def get_options(answer,s2v):
    distractors =[]

    try:
        distractors = sense2vec_get_words(answer,s2v)
        if len(distractors) > 0:
            print(" Sense2vec_distractors successful for word : ", answer)
            return distractors,"sense2vec"
    except:
        print (" Sense2vec_distractors failed for word : ",answer)


    return distractors,"None"

In [12]:
def tokenize_sentences(text):
    sentences = [sent_tokenize(text)]
    sentences = [y for x in sentences for y in x]
    # Remove any short sentences less than 20 letters.
    sentences = [sentence.strip() for sentence in sentences if len(sentence) > 20]
    return sentences

In [13]:
def get_sentences_for_keyword(keywords, sentences):
    keyword_processor = KeywordProcessor()
    keyword_sentences = {}
    for word in keywords:
        word = word.strip()
        keyword_sentences[word] = []
        keyword_processor.add_keyword(word)
    for sentence in sentences:
        keywords_found = keyword_processor.extract_keywords(sentence)
        for key in keywords_found:
            keyword_sentences[key].append(sentence)

    for key in keyword_sentences.keys():
        values = keyword_sentences[key]
        values = sorted(values, key=len, reverse=True)
        keyword_sentences[key] = values

    delete_keys = []
    for k in keyword_sentences.keys():
        if len(keyword_sentences[k]) == 0:
            delete_keys.append(k)
    for del_key in delete_keys:
        del keyword_sentences[del_key]

    return keyword_sentences

In [14]:
def is_far(words_list,currentword,thresh,normalized_levenshtein):
    threshold = thresh
    score_list =[]
    for word in words_list:
        score_list.append(normalized_levenshtein.distance(word.lower(),currentword.lower()))
    if min(score_list)>=threshold:
        return True
    else:
        return False

In [15]:
def filter_phrases(phrase_keys,max,normalized_levenshtein ):
    filtered_phrases =[]
    if len(phrase_keys)>0:
        filtered_phrases.append(phrase_keys[0])
        for ph in phrase_keys[1:]:
            if is_far(filtered_phrases,ph,0.7,normalized_levenshtein ):
                filtered_phrases.append(ph)
            if len(filtered_phrases)>=max:
                break
    return filtered_phrases

In [16]:
# def get_nouns_multipartite(text):
#     out = []

#     extractor = pke.unsupervised.MultipartiteRank()
#     extractor.load_document(input=text, language='en')
#     pos = {'PROPN', 'NOUN'}
#     stoplist = list(string.punctuation)
#     stoplist += stopwords.words('english')
#     extractor.candidate_selection(pos=pos, stoplist=stoplist)
#     # 4. build the Multipartite graph and rank candidates using random walk,
#     #    alpha controls the weight adjustment mechanism, see TopicRank for
#     #    threshold/method parameters.
#     try:
#         extractor.candidate_weighting(alpha=1.1,
#                                       threshold=0.75,
#                                       method='average')
#     except:
#         return out

#     keyphrases = extractor.get_n_best(n=10)

#     for key in keyphrases:
#         out.append(key[0])

#     return out

def get_nouns_multipartite(text):
    out = []

    extractor = pke.unsupervised.MultipartiteRank()
    extractor.load_document(input=text, language='en')
    pos = {'PROPN', 'NOUN'}
    # Remove the 'stoplist' argument
    extractor.candidate_selection(pos=pos)
    
    try:
        extractor.candidate_weighting(alpha=1.1, threshold=0.75, method='average')
    except:
        return out

    keyphrases = extractor.get_n_best(n=10)

    for key in keyphrases:
        out.append(key[0])

    return out


In [17]:
def get_phrases(doc):
    phrases={}
    for np in doc.noun_chunks:
        phrase =np.text
        len_phrase = len(phrase.split())
        if len_phrase > 1:
            if phrase not in phrases:
                phrases[phrase]=1
            else:
                phrases[phrase]=phrases[phrase]+1

    phrase_keys=list(phrases.keys())
    phrase_keys = sorted(phrase_keys, key= lambda x: len(x),reverse=True)
    phrase_keys=phrase_keys[:50]
    return phrase_keys

In [18]:

def get_keywords(nlp,text,max_keywords,s2v,fdist,normalized_levenshtein,no_of_sentences):
    doc = nlp(text)
    max_keywords = int(max_keywords)

    keywords = get_nouns_multipartite(text)
    keywords = sorted(keywords, key=lambda x: fdist[x])
    keywords = filter_phrases(keywords, max_keywords,normalized_levenshtein )

    phrase_keys = get_phrases(doc)
    filtered_phrases = filter_phrases(phrase_keys, max_keywords,normalized_levenshtein )

    total_phrases = keywords + filtered_phrases

    total_phrases_filtered = filter_phrases(total_phrases, min(max_keywords, 2*no_of_sentences),normalized_levenshtein )


    answers = []
    for answer in total_phrases_filtered:
        if answer not in answers and MCQs_available(answer,s2v):
            answers.append(answer)

    answers = answers[:max_keywords]
    return answers

In [19]:

def generate_questions(keyword_sent_mapping,device,tokenizer,model,sense2vec,normalized_levenshtein):
    model.to(device)
    batch_text = []
    answers = keyword_sent_mapping.keys()
    for answer in answers:
        txt = keyword_sent_mapping[answer]
        context = "context: " + txt
        text = context + " " + "answer: " + answer + " </s>"
        batch_text.append(text)

    # print ("batch_text")
    # print (batch_text)
    encoding = tokenizer.batch_encode_plus(batch_text, pad_to_max_length=True, return_tensors="pt")


    print ("Running model for generation")
    input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

    with torch.no_grad():
        outs = model.generate(input_ids=input_ids,
                              attention_mask=attention_masks,
                              max_length=150)

    output_array ={}
    output_array["questions"] =[]

    for index, val in enumerate(answers):
        individual_question ={}
        out = outs[index, :]
        dec = tokenizer.decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)

        Question = dec.replace("question:", "")
        Question = Question.strip()
        individual_question["question_statement"] = Question
        individual_question["question_type"] = "MCQ"
        individual_question["answer"] = val
        individual_question["id"] = index+1
        individual_question["options"], individual_question["options_algorithm"] = get_options(val, sense2vec)

        individual_question["options"] =  filter_phrases(individual_question["options"], 10,normalized_levenshtein)
        index = 3
        individual_question["extra_options"]= individual_question["options"][index:]
        individual_question["options"] = individual_question["options"][:index]
        individual_question["context"] = keyword_sent_mapping[val]
        # individual_question["options"]=[]
        # individual_question["options_algorithm"] = ""
        if len(individual_question["options"])>0:
            output_array["questions"].append(individual_question)

        print("Context: ", keyword_sent_mapping[val])
        print("Generated Question: ")
        print(Question)
        print("Answer: ", val)

    return output_array

In [20]:
class PythonPredictor:
    def __init__(self):
        model_file_1 = "/kaggle/input/s2v-old/s2v_old"
        
        print ("s2v model already exists.")
        self.tokenizer = T5Tokenizer.from_pretrained('t5-base')
        model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_squad_v1')
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model.to(device)
        # model.eval()
        self.device = device
        self.model = model
        self.nlp = spacy.load('en_core_web_sm')

        self.s2v = Sense2Vec().from_disk('/kaggle/input/s2v-old/s2v_old')

        self.fdist = FreqDist(brown.words())
        self.normalized_levenshtein = NormalizedLevenshtein()
        self.set_seed(42)
    def set_seed(self,seed):
        numpy.random.seed(seed)
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)
    def predict(self, payload):
        start = time.time()
        inp = {
            "input_text": payload.get("input_text"),
            "max_questions": payload.get("max_questions", 4)
        }

        text = inp['input_text']
        sentences = tokenize_sentences(text)
        joiner = " "
        modified_text = joiner.join(sentences)


        keywords = get_keywords(self.nlp,modified_text,inp['max_questions'],self.s2v,self.fdist,self.normalized_levenshtein,len(sentences) )


        keyword_sentence_mapping = get_sentences_for_keyword(keywords, sentences)

        for k in keyword_sentence_mapping.keys():
            text_snippet = " ".join(keyword_sentence_mapping[k][:3])
            keyword_sentence_mapping[k] = text_snippet

   
        final_output = {}

        if len(keyword_sentence_mapping.keys()) == 0:
            return json.dumps(final_output)
        else:
            try:
                generated_questions = generate_questions(keyword_sentence_mapping,self.device,self.tokenizer,self.model,self.s2v,self.normalized_levenshtein)

            except:
                return final_output
            end = time.time()

            final_output["statement"] = modified_text
            final_output["questions"] = generated_questions["questions"]
            final_output["time_taken"] = end-start

            return final_output
    

In [27]:
if __name__ == "__main__":

    mcq = PythonPredictor()

    payload={
    "input_text" : "A double-walled sac called the pericardium encases the heart, which serves to protect the heart and anchor it inside the chest. Between the outer layer, the parietal pericardium, and the inner layer, the serous pericardium, runs pericardial fluid, which lubricates the heart during contractions and movements of the lungs and diaphragm.The heart's outer wall consists of three layers. The outermost wall layer, or epicardium, is the inner wall of the pericardium. The middle layer, or myocardium, contains the muscle that contracts. The inner layer, or endocardium, is the lining that contacts the blood.The tricuspid valve and the mitral valve make up the atrioventricular (AV) valves, which connect the atria and the ventricles. The pulmonary semi-lunar valve separates the right ventricle from the pulmonary artery, and the aortic valve separates the left ventricle from the aorta. The heartstrings, or chordae tendinae, anchor the valves to heart muscles."
    }

    out= mcq.predict(payload)

s2v model already exists.


ImportError: 
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


In [22]:
out["statement"]

NameError: name 'out' is not defined

In [26]:
out

{'statement': "A double-walled sac called the pericardium encases the heart, which serves to protect the heart and anchor it inside the chest. Between the outer layer, the parietal pericardium, and the inner layer, the serous pericardium, runs pericardial fluid, which lubricates the heart during contractions and movements of the lungs and diaphragm.The heart's outer wall consists of three layers. The outermost wall layer, or epicardium, is the inner wall of the pericardium. The middle layer, or myocardium, contains the muscle that contracts. The inner layer, or endocardium, is the lining that contacts the blood.The tricuspid valve and the mitral valve make up the atrioventricular (AV) valves, which connect the atria and the ventricles. The pulmonary semi-lunar valve separates the right ventricle from the pulmonary artery, and the aortic valve separates the left ventricle from the aorta. The heartstrings, or chordae tendinae, anchor the valves to heart muscles.",
 'questions': [{'questi

In [27]:
out["questions"][0]

{'question_statement': 'What is the double-walled sac that encases the heart?',
 'question_type': 'MCQ',
 'answer': 'pericardium',
 'id': 1,
 'options': ['Aorta', 'Abdominal Cavity', 'Fistula'],
 'options_algorithm': 'sense2vec',
 'extra_options': ['Hematoma',
  'Right Lung',
  'Blood Vessels',
  'Surrounding Tissue',
  'Peritoneum',
  'Spleen',
  'Abscess'],
 'context': "Between the outer layer, the parietal pericardium, and the inner layer, the serous pericardium, runs pericardial fluid, which lubricates the heart during contractions and movements of the lungs and diaphragm.The heart's outer wall consists of three layers. Between the outer layer, the parietal pericardium, and the inner layer, the serous pericardium, runs pericardial fluid, which lubricates the heart during contractions and movements of the lungs and diaphragm.The heart's outer wall consists of three layers. A double-walled sac called the pericardium encases the heart, which serves to protect the heart and anchor it i

# Test On Different Texts

In [35]:
# test 1

if __name__ == "__main__":
    mcq = PythonPredictor()

    payload = {
        "input_text": "The respiratory system is a network of organs and tissues that help you breathe. It includes your airways, lungs, and blood vessels. The muscles that power your lungs are also part of the respiratory system. These parts work together to move oxygen throughout the body and clean out waste gases like carbon dioxide."
    }

    out = mcq.predict(payload)
out

s2v model already exists.


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Sense2vec_distractors successful for word :  blood vessels
 Sense2vec_distractors successful for word :  airways
 Sense2vec_distractors successful for word :  tissues
 Sense2vec_distractors successful for word :  lungs


{'statement': 'The respiratory system is a network of organs and tissues that help you breathe. It includes your airways, lungs, and blood vessels. The muscles that power your lungs are also part of the respiratory system. These parts work together to move oxygen throughout the body and clean out waste gases like carbon dioxide.',
 'questions': [{'question_statement': 'Along with your airways, lungs and airways, what else is included in your body?',
   'question_type': 'MCQ',
   'answer': 'blood vessels',
   'id': 1,
   'options': ['Capillaries', 'Blood Flow', 'Surrounding Tissue'],
   'options_algorithm': 'sense2vec',
   'extra_options': ['Smooth Muscle', 'Vasculature', 'Other Organs'],
   'context': 'It includes your airways, lungs, and blood vessels.'},
  {'question_statement': 'Along with the lungs and blood vessels, what organ is a part of the body?',
   'question_type': 'MCQ',
   'answer': 'airways',
   'id': 2,
   'options': ['Nasal Passages', 'Sinuses', 'Respiratory System'],
 

In [37]:
# test 1
if __name__ == "__main__":
    mcq = PythonPredictor()

    payload = {
        "input_text": "Object-oriented programming (OOP) is a programming paradigm based on the concept of 'objects', which can contain data and code: data in the form of fields (often known as attributes or properties), and code, in the form of procedures (often known as methods). A feature of objects is that an object's own procedures can access and often modify the data fields of itself (objects have a notion of 'this' or 'self'). In OOP, computer programs are designed by making them out of objects that interact with one another. There is a significant diversity of OOP languages, but the most popular ones are class-based, meaning that objects are instances of classes, which also determine their types."
    }

    out = mcq.predict(payload)
out

s2v model already exists.


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Sense2vec_distractors successful for word :  oop
 Sense2vec_distractors successful for word :  programming
 Sense2vec_distractors successful for word :  attributes


{'statement': "Object-oriented programming (OOP) is a programming paradigm based on the concept of 'objects', which can contain data and code: data in the form of fields (often known as attributes or properties), and code, in the form of procedures (often known as methods). A feature of objects is that an object's own procedures can access and often modify the data fields of itself (objects have a notion of 'this' or 'self'). In OOP, computer programs are designed by making them out of objects that interact with one another. There is a significant diversity of OOP languages, but the most popular ones are class-based, meaning that objects are instances of classes, which also determine their types.",
 'questions': [{'question_statement': 'What is the acronym for Object-oriented programming?',
   'question_type': 'MCQ',
   'answer': 'oop',
   'id': 1,
   'options': ['Functional Programming', 'Haskell', 'C++'],
   'options_algorithm': 'sense2vec',
   'extra_options': ['Object Orientation',