Question generation can be divided into 4 parts. Let’s first mention all those steps to get a high-level overview of how the entire pipeline will work.


Extractive Text Summarization

Keyword Extraction

Distractor Generation


Question Generation


# 1) Extractive Text Summarization


For extractive summarization, we have various techniques like TextRank, Sentence Scoring based on Word Frequency, and Bert Extractive summarizer. For our use case, we are going to use the Bert Extractive summarizer.



In this method, we start by converting sentences into embeddings, which are numerical representations capturing their semantic meaning. Next, we employ a clustering algorithm to group these embeddings, aiming to identify sentences with similar meanings. By finding sentences closest to each cluster's centroid, we ensure that they encapsulate the essence of the cluster. Additionally, we incorporate Coreference resolution technique to handle ambiguous pronouns or references, thereby ensuring coherence in summaries by providing necessary contextual information. Once clustering is completed, we have the flexibility to choose any desired number of representative sentences from each cluster, forming concise and informative summaries.

In [2]:
#instalations
# !pip install bert-extractive-summarizer
# !pip install python-rake==1.4.4
# !pip install sense2vec


In [3]:
#import library
from summarizer import Summarizer

text = "The Industrial Revolution was a period of profound change that transformed societies from predominantly agrarian to industrial. It began in Britain in the late 18th century and spread across Europe and North America during the 19th century. Key innovations during this period included the development of steam power, the mechanization of textile production, and the rise of factories. These changes led to urbanization as people moved from rural areas to cities in search of work. The Industrial Revolution had far-reaching social, economic, and environmental impacts, shaping the modern world."
model = Summarizer()
result = model(text, min_length=60, max_length = 500 , ratio = 0.4)
summarized_text = ''.join(result)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



# 2) Keyword Extraction



Once we've obtained the condensed text from the previous stage of our pipeline, let's delve into the subsequent step: Keyword Extraction. In this phase, we aim to identify significant keywords from the summarized content. Several techniques exist for this purpose, including TF-IDF, YAKE (Yet Another Keyword Extractor), and RAKE (Rapid Automatic Keyword Extraction). TF-IDF and YAKE operate on statistical principles, whereas RAKE functions as a Graph-based method. Considering our requirements, we opt for RAKE due to its speed and precision.

In [4]:
import RAKE
import operator

# Rake setup with stopword directory
#(https://raw.githubusercontent.com/zelandiya/RAKE-tutorial/master/data/stoplists/SmartStoplist.txt)
stop_dir = "/kaggle/input/smart-english-stoplist/SmartStoplist.txt"
rake_object = RAKE.Rake(stop_dir)
# Sample text to test RAKE
text = "Google has silently introduced a fresh method for Android users to access podcasts and subscribe to their favorite shows, and it's already operational on your device. The news broke through an exclusive report from podcast production company Pacific Content, shedding light on this development directly from Google."
# Extract keywords
keywords = rake_object.run(text)
print ("keywords: ", keywords)

keywords:  [('podcast production company pacific content', 25.0), ('silently introduced', 4.0), ('fresh method', 4.0), ('android users', 4.0), ('access podcasts', 4.0), ('favorite shows', 4.0), ('news broke', 4.0), ('exclusive report', 4.0), ('shedding light', 4.0), ('development directly', 4.0), ('google', 1.0), ('subscribe', 1.0), ('operational', 1.0), ('device', 1.0)]


# Distractore Generator


Following the extraction of keywords, our subsequent task involves generating incorrect choices, known as distractors. Crafting these distractors within the context of the correct answer presents a nuanced challenge due to the potential for different interpretations of the same word. Take "Amazon," for instance, which could refer to either the tech giant or the vast rainforest.

To tackle this, various lexical databases exist where each word is linked to its definition, and if a word has multiple meanings, multiple definitions are provided. WordNet stands out as an example of such a database. However, for our specific scenario, we opt for Sense2Vec, a variant of Word2Vec with enhancements that result in improved performance when compared to Word2Vec.

In [5]:
from sense2vec import Sense2Vec
from collections import OrderedDict

s2v = Sense2Vec().from_disk("/kaggle/input/s2v-old/s2v_reddit_2015_md/s2v_old")

def sense2vec_get_words(word, s2v):
    output = []
    word = word.lower()
    word = word.replace(" ", "_")

    sense = s2v.get_best_sense(word)
    most_similar = s2v.most_similar(sense, n=20)

    for each_word in most_similar:
        append_word = each_word[0].split("|")[0].replace("_", " ").lower()
        if append_word.lower() != word:
            output.append(append_word.title())

    out = list(OrderedDict.fromkeys(output))
    return out

word = "Natural Language Processing"
distractors = sense2vec_get_words(word, s2v)

print("Distractors for", word, ": ", distractors)


Distractors for Natural Language Processing :  ['Machine Learning', 'Computer Vision', 'Deep Learning', 'Data Analysis', 'Neural Nets', 'Relational Databases', 'Algorithms', 'Neural Networks', 'Data Processing', 'Image Recognition', 'Nlp', 'Big Data', 'Data Science', 'Big Data Analysis', 'Information Retrieval', 'Speech Recognition', 'Programming Languages']


# 4 Question Generation Using t5 Model

Let's recap the progress we've made thus far. Initially, we condensed the lengthy input text into a concise summary. Next, we extracted keywords from this summary, which will serve as our correct answers. Following this, we crafted plausible incorrect choices that are contextually relevant to the correct answers.

For the final step, we'll create questions using both the context and the answers. To achieve this, we'll leverage a pre-trained transformer model that has been fine-tuned on the SQUAD dataset. Specifically, we'll utilize the T5 transformer, renowned for its encoder-decoder architecture, which excels in text-to-text tasks. While delving into an exhaustive explanation of the T5 transformer lies beyond the scope of this discussion, we'll delve into the implementation specifics.

In [6]:
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer

import torch

# Check if GPU is available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


question_model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_squad_v1')
question_tokenizer = T5Tokenizer.from_pretrained('t5-base')
# question_model = T5ForConditionalGeneration.from_pretrained(model_path)
# question_tokenizer = T5Tokenizer.from_pretrained(model_path)
question_model = question_model.to(device)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:

def get_question(context,answer,model,tokenizer):
  text = "context: {} answer: {}".format(context,answer)
  encoding = tokenizer.encode_plus(text,max_length=384, pad_to_max_length=False,truncation=True, return_tensors="pt").to(device)
  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = model.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=5,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  max_length=72)


  dec = [tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]


  Question = dec[0].replace("question:","")
  Question= Question.strip()
  return Question

In [8]:
def generate_mcq(context, answer, distractors):
    # Generate the question
    question = get_question(context, answer, question_model, question_tokenizer)

    # Create the MCQ
    mcq = {
        "question": question,
        "options": [answer] + distractors
    }

    return mcq

# Example usage:
context = "Google has silently introduced a fresh method for Android users to access podcasts and subscribe to their favorite shows, and it's already operational on your device. The news broke through an exclusive report from podcast production company Pacific Content, shedding light on this development directly from Google."
answer = "Google"
distractors = ["Amazon", "Microsoft", "Apple"]

mcq = generate_mcq(context, answer, distractors)
print("MCQ:")
print("Question:", mcq["question"])
print("Options:", mcq["options"])


MCQ:
Question: Who has quietly introduced a new method for Android users to access podcasts?
Options: ['Google', 'Amazon', 'Microsoft', 'Apple']
