**Automated Question Generator Using NLP**

Manually generating questions from text is time-consuming and requires significant effort. This tool aims to automate the process, making it easier for anyone to get a clear overview of the main concepts in a text through question prompts. Whether for study or review, this question-generation tool helps users break down dense material into assessable chunks, providing a quicker and more interactive way to engage with content.

**1.Installing Necessary Libraries**

This command installs the transformers library, which gives you access to powerful pre-trained language models for tasks like generating questions, summarizing text, and more.

Adding **sentencepiece** installs an extra tool called sentencepiece, which helps break down complex text into smaller parts so the model can understand it better. This is especially useful for certain models (like T5) requiring this text processing type.


In [1]:
!pip install transformers[sentencepiece]



**pymupdf**:Extract text from PDF files

**python-docx**:Extract text from DOCX files

**transformers**:Use powerful language models to generate questions

In [2]:
!pip install pymupdf
!pip install python-docx
!pip install transformers

Collecting pymupdf
  Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.24.13
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2


In [10]:
import fitz #imports the fitz module from the PyMuPDF library, which allows you to work with PDF files.
import docx
import spacy #imports the spaCy library, a popular NLP library that offers tools for tokenizing, parsing, and analyzing the structure of text.
from transformers import pipeline # for loading pre-trained language models and performing NLP tasks (like text generation)
from sklearn.feature_extraction.text import TfidfVectorizer # helps identify important keywords from the text by scoring words based on how often they appear, making it easier to focus on relevant information
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer #loads pre-trained models specifically for sequence-to-sequence tasks, such as generating text from an input sentence.
import random #used to select sentences or words randomly from the text, adding variety to the question generation process.
import re #useful for text cleaning and pattern matching, such as removing special characters, identifying patterns in text, or finding specific phrases.

In [16]:
import spacy

# Load the SpaCy language model
nlp = spacy.load("en_core_web_sm")

**2.Setting Up Question Geberation Model**

Model: Loads T5 model for question generation.

Tokenizer: Tokenizes and encodes text for model input.

Model Loading: T5 Seq2Seq model processes input to generate text output.

Pipeline: Automates question generation using the model and tokenizer.

In [4]:
model_name = "valhalla/t5-base-qg-hl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

question_generator = pipeline(
    "text2text-generation",  # Use "text2text-generation" for question generation
    model=model,
    tokenizer=tokenizer
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/15.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

**3. Extracting Text from Files**

Purpose: Extracts text from PDF, DOCX, or TXT files.

Text Initialization: Creates an empty string to store extracted text.

File Type Check: Identifies file type based on the extension.

PDF: Reads each page’s text and appends it.

DOCX: Reads each paragraph and appends with newlines.

TXT: Reads the entire file in one step.

Return: Outputs the extracted text.


In [37]:
def extract_text(file_path):
    """Extract text from file (pdf, docx, txt) based on file extension."""
    text = ""
    if file_path.endswith(".pdf"):
        pdf_doc = fitz.open(file_path)
        for page_num in range(pdf_doc.page_count):
            page = pdf_doc.load_page(page_num)
            text += page.get_text()
    elif file_path.endswith(".docx"):
        doc = docx.Document(file_path)
        for paragraph in doc.paragraphs:
            text += paragraph.text + "\n"
    elif file_path.endswith(".txt"):
        # Try to read with 'utf-8' encoding, fall back to 'latin-1' if it fails
        try:
            with open(file_path, "r", encoding="utf-8") as file:
                text = file.read()
        except UnicodeDecodeError:
            with open(file_path, "r", encoding="latin-1") as file:
                text = file.read()
    return text

**4. Text Preprocessing**

* The **preprocess_text** function is designed to prepare raw text for question
generation by cleaning and structuring it into meaningful sentences. This is essential because raw text can often contain unnecessary spaces, very short sentences, or fragments that don’t contribute much to the context.


* **tfidf_keyword_extraction Function**

   Purpose: Extracts keywords from a list of sentences using TF-IDF.

  TF-IDF Vectorizer: Initializes a TF-IDF vectorizer limited to a specified number of keywords.

  Fit Model: Fits the vectorizer to the given sentences to compute term frequencies.

  Return Keywords: Outputs the top keywords based on TF-IDF scores.

* **generate_questions**

  Purpose: Generates questions from text.

  Splitting: Breaks text into sentences.
  Loop: For each question:

  Context: Randomly selects three sentences.

   Highlighting: Adds `<hl>` tags for focus.

  Generate: Creates a question using `question_generator`.

  Error Handling: Logs issues and adds fallback text if needed.

  Return: Outputs generated questions list.


In [38]:
def preprocess_text(text):
    """Tokenize and clean text using SpaCy."""
    # nlp object is now available as it's loaded globally
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]
    return sentences

def tfidf_keyword_extraction(sentences, num_keywords=5):
    """Extract keywords using TF-IDF."""
    tfidf = TfidfVectorizer(max_features=num_keywords)
    tfidf.fit(sentences)
    return tfidf.get_feature_names_out()

def generate_questions(text, num_questions):
    """Generate questions based on extracted text with varied highlighted sentences."""
    sentences = text.split(".")
    questions = []
    for _ in range(num_questions):
        # Randomly select three sentences as context for each question
        selected_sentences = " ".join(random.sample(sentences, min(3, len(sentences))))
        highlighted_text = f"<hl> {selected_sentences} </hl>"

        # Generate question with adjusted length
        try:
            output = question_generator(highlighted_text, max_new_tokens=50)
            # Retrieve question text from 'generated_text' instead of 'question'
            if output and "generated_text" in output[0]:
                question = output[0]["generated_text"]
            else:
                print("Unexpected output format:", output)  # Debugging line
                question = "Could not generate question for this context."
            questions.append(question)
        except Exception as e:
            print(f"Error generating question: {e}")
            questions.append("Error occurred during question generation.")
    return questions




**6.Main Function**

**User Input:**Asks for file path and number of questions.

**Text Extraction:** Extracts text from the given file.

**Text Preprocessing:** Processes the text into sentences.

**Keyword Extraction:** Optionally extracts and prints keywords.

**Question Generation:** Generates and prints the specified number of questions.

**Execution:** Runs the `main` function when the script is executed.


In [21]:
def main():
    # Get user input for file path and number of questions
    file_path = input("Enter the file path (pdf, docx, or txt): ")
    num_questions = int(input("Enter the number of questions to generate: "))

    # Extract and preprocess text
    text = extract_text(file_path)
    sentences = preprocess_text(text)

    # Optional: Extract keywords (for analysis or enhancement)
    keywords = tfidf_keyword_extraction(sentences)
    print("Extracted Keywords:", keywords)


    questions = generate_questions(" ".join(sentences), num_questions)

    # Display generated questions
    print("\nGenerated Questions:")
    for i, question in enumerate(questions, 1):
        print(f"{i}. {question}")

if __name__ == "__main__":
    main()

Enter the file path (pdf, docx, or txt): Write-up for image metrics and loss functions.docx
Enter the number of questions to generate: 20
Extracted Keywords: ['image' 'is' 'of' 'the' 'to']

Generated Questions:
1. What is the RRIQA Output?
2. What are BRISQUE's key advantages?
3. What are FID's key advantages?
4. What is the output of the texture loss calculation?
5. What is the definition of a high RRIQA value?
6. What is the difficulty in FSIM?
7. What is BRISQUE used for?
8. What is BRISQUE's main purpose?
9. What is the formula for PSNR?
10. What is the main limitation of IFC?
11. What is the most common use of edge preservation?
12. What is the output of the BRISQUE calculation?
13. What is the use of L1 loss in image processing?
14. What is the significance of PIPE?
15. What is the typical approach to computing edge loss?
16. What is the meaning of BRISQUE?
17. What is the requirement for PIPE to be significant?
18. What is the main point of RRIQA?
19. What is the typical interpr

In [33]:
def main():
    # Get user input for file path and number of questions
    file_path = input("Enter the file path (pdf, docx, or txt): ")
    num_questions = int(input("Enter the number of questions to generate: "))

    # Extract and preprocess text
    text = extract_text(file_path)
    sentences = preprocess_text(text)

    # Optional: Extract keywords (for analysis or enhancement)
    keywords = tfidf_keyword_extraction(sentences)
    print("Extracted Keywords:", keywords)


    questions = generate_questions(" ".join(sentences), num_questions)

    # Display generated questions
    print("\nGenerated Questions:")
    for i, question in enumerate(questions, 1):
        print(f"{i}. {question}")

if __name__ == "__main__":
    main()

Enter the file path (pdf, docx, or txt): RNN,LSTM,GRU.pdf
Enter the number of questions to generate: 20
Extracted Keywords: ['and' 'of' 'output' 'the' 'to']

Generated Questions:
1. What means “remember everything” and activation output of 0?
2. What is the basic structure of RNN?
3. What three components are in the RNN structure?
4. How do LSTMs work?
5. What is the name of the input gate?
6. What output is then given out?
7. What does 1 0 mean?
8. What gate is used to generate a scaling fraction?
9. What is the final hidden state?
10. What is the name of the input gate?
11. What is the difference between update and reset gates?
12. What is the output gate?
13. What gate determines which information to discard?
14. What is the basic structure of RNN?
15. How many to many is a simple neural network used for?
16. What three components of RNN are used?
17. What is the function that determines which information is about to enter state of lstm?
18. What is the hidden layer?
19. What is the

In [40]:
def main():
    # Get user input for file path and number of questions
    file_path = input("Enter the file path (pdf, docx, or txt): ")
    num_questions = int(input("Enter the number of questions to generate: "))

    # Extract and preprocess text
    text = extract_text(file_path)
    sentences = preprocess_text(text)

    # Optional: Extract keywords (for analysis or enhancement)
    keywords = tfidf_keyword_extraction(sentences)
    print("Extracted Keywords:", keywords)


    questions = generate_questions(" ".join(sentences), num_questions)

    # Display generated questions
    print("\nGenerated Questions:")
    for i, question in enumerate(questions, 1):
        print(f"{i}. {question}")

if __name__ == "__main__":
    main()

Enter the file path (pdf, docx, or txt): NLP.txt
Enter the number of questions to generate: 10
Extracted Keywords: ['and' 'in' 'nlp' 'of' 'to']

Generated Questions:
1. What is the process of dividing text into smaller units?
2. What is the name of the new technology that allows machines to understand human speech?
3. What is the term for the process of dividing text into smaller units?
4. What is the term for a tool that monitors social media for public opinion or customer feedback?
5. What is the process of splitting text into smaller units called?
6. What is the verb used to describe a sentence?
7. What is the term for the process of condensing a large body of text into a shorter, more concise summary?
8. What is the main use of NLP?
9. How many units does the process of splitting text into smaller units?
10. How many models have significantly improved the accuracy and capabilities of NLP systems?
