# Preprocess Quiz Dataset

This notebook preprocesses the JSON file related to quizzes located at quiz-app/src/assets/translations/es.json in the repository https://github.com/microsoft/ML-For-Beginners. The quizzes provided are in multiple-choice question (MCQ) format across various topics. Our goal is to extract the questions and answers to create an evaluation and test dataset, where each example contains the "question" and "answer" as JSON fields. 
 
We will create two datasets: one with MCQ-type questions that include options, which can be generated by directly extracting the relevant keys for questions and answers from each topic, and another with open-ended questions that do not include options. 
 
Since some questions contain options that provide additional context for answering, such as fill-in-the-blank formats or options like "both of these," we need to rephrase these questions to ensure that answering does not require context from the options. While this could be done manually, we will utilize a language model (LLM) and its tool-calling feature to provide examples as context to the LLM, instructing it to convert the questions into open-ended question and answer pairs.

In [None]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter you Google API key: ")

### Load the Quiz file

In [None]:
import json

with open('../corpus/itml-quizzes.json', 'r') as f:
    quiz_obj = json.load(f)

### Extract the question and answer pairs from the JSON structure

In [None]:
dataset = []
lesson_quizzes: list = quiz_obj[0]["quizzes"]
for quiz_info in lesson_quizzes:
    title = quiz_info["title"]  # Lesson wise title
    quizzes = quiz_info["quiz"]  # List of quiz for a each lesson
    for quiz in quizzes:
        question_text = quiz["questionText"]
        answer = list(
            filter(lambda item: item["isCorrect"] == "true", quiz["answerOptions"])
        )[0]["answerText"]

        mcq = "\n".join(["- " + item["answerText"] for item in quiz["answerOptions"]])
        question = f"Question:\n{question_text}\n\nOptions:\n{mcq}"

        dataset.append({"question": question, "answer": answer})

### Display a Sample Question and Answer Pair

In [None]:
print(dataset[1]["question"]) 
print(f'\nAnswer:\n{dataset[1]["answer"]}') 

Question:
What is the technical difference between classical ML and deep learning?

Options:
- classical ML was invented first
- the use of neural networks
- deep learning is used in robots

Answer:
the use of neural networks


### Generate Open Ended Questions from MCQ formatted questions using LLM 

> This notebook utilizes Gemini models via the Instructor package to rephrase multiple-choice questions (MCQs). To run the following cells, a `GOOGLE_API_KEY` is required. You can store this key in a .env file located in the root of the project, enter it directly in the input box when prompted, or set it directly in the environment variable using `os.environ["GOOGLE_API_KEY"]`. If you wish to use a different model, simply modify the instructor client instance and provide the necessary API key for that platform in the environment variables.

In [None]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter you Google API key: ") 

# os.environ["GOOGLE_API_KEY"] = <GOOGLE_API_KEY>

In [None]:
from dotenv import load_dotenv
from pydantic import BaseModel
import instructor
import google.generativeai as genai
from limiter import Limiter
load_dotenv()

class Example(BaseModel):
    question: str
    answer: str

instructor_client = instructor.from_gemini(
    client=genai.GenerativeModel(model_name='gemini-2.0-flash'), mode=instructor.Mode.GEMINI_JSON, use_async=False
)

# These values are specific to the gemini-2.0-flash model free tier rpm. 
limiter = Limiter(rate=0.15, consume=1, capacity=1)

@limiter
def generate_open_qa(mcqs):
    # Meta prompting link :) https://chatgpt.com/share/67e40d4e-a764-8006-a436-768679e5fdcd
    prompt = f"""Convert the following multiple-choice question (MCQ) and Answer pairs into an open-ended question and answer pairs. Ensure that:
    1. The rephrased question does not rely on the provided answer choices.
    2. The modified answer remains accurate and meaningful and align with the original answer.
    3. If the question includes options like "Both of the above" or "True/False," adjust the answer to be self-contained.
    4. If the question is a fill-in-the-blank type, rewrite it as a complete question.

    MCQS:
    {"\n".join([str(mcq) for mcq in mcqs])}
    """
    return instructor_client.messages.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        response_model=list[Example],
    )

#### Generate 10 examples per request, as larger contexts may negatively impact output quality. 

> Note: This number is not derived from any specific experimentation.

In [None]:
from tqdm.notebook import tqdm
from time import sleep

bar = tqdm(total=len(dataset), desc="Generating QA pairs:")
qa_dataset = []
for i in range(0, len(dataset), 10):
    sleep(0.5)
    start, end = i, min(i + 10, len(dataset))
    qa_dataset += generate_open_qa(dataset[start:end])
    bar.update(min(10, len(dataset) - i))

Generating QA pairs::   0%|          | 0/156 [00:00<?, ?it/s]

In [None]:
import pandas as pd
df_qa_dataset = pd.DataFrame([dict(qa) for qa in qa_dataset]) # casting to dict is required since the examples are instances of the Example object
df_qa_dataset.head(5)

Unnamed: 0,question,answer
0,Are machine learning applications prevalent in...,True
1,What is the primary technical distinction betw...,"Deep learning utilizes neural networks, while ..."
2,What are some reasons a business might impleme...,Businesses might use ML strategies to automate...
3,What is the primary concept that machine learn...,Machine learning algorithms are designed to si...
4,Could you provide an example of a classical ma...,An example of a classical machine learning tec...


In [None]:
df_dataset = pd.DataFrame(dataset)
df_dataset.head(5)

Unnamed: 0,question,answer
0,Question:\nApplications of machine learning ar...,True
1,Question:\nWhat is the technical difference be...,the use of neural networks
2,Question:\nWhy might a business want to use ML...,both of the above
3,Question:\nMachine learning algorithms are mea...,the human brain
4,Question:\nWhat is an example of a classical M...,natural language processing


### Shuffle the data and create a train(evaluation set) & test split of both the datsets

In [None]:
random_seed = 42
# Setting frac=1 effectively shuffles the dataset
shuffled_indices = df_dataset.sample(frac=1, random_state=random_seed).index 

train_indices = shuffled_indices[:100]
test_indices = shuffled_indices[100:]

train_df = df_dataset.loc[train_indices]
test_df = df_dataset.loc[test_indices]

train_qa_df = df_qa_dataset.loc[train_indices]
test_qa_df = df_qa_dataset.loc[test_indices]

### Display a Sample QA pair before and after QA Repharsing 

#### Before

In [None]:
print(train_df.iloc[5]['question'])
print("\nAnswer:\n", train_df.iloc[5]['answer'])

Question:
The process of splitting a dataset into a certain ratio of training and testing dataset using Scikit Learn's 'train_test_split()' method/function is called:

Options:
- Cross-Validation
- Hold-Out Validation
- Leave one out Validation

Answer:
 Hold-Out Validation


#### After

In [None]:
print("Question:\n",train_qa_df.iloc[5]['question'])
print("\nAnswer:\n", train_qa_df.iloc[5]['answer'])

Question:
 What is the name of the process of splitting a dataset into training and testing sets using Scikit Learn's 'train_test_split()' method/function?

Answer:
 This process is called Hold-Out Validation.


### Save the splits as JSONL files

In [None]:
train_df.to_json('../datasets/itml/itml_mcq_eval.jsonl', lines=True, orient='records')
test_df.to_json('../datasets/itml/itml_mcq_test.jsonl', lines=True, orient='records')
train_qa_df.to_json('../datasets/itml/itml_qa_eval.jsonl', lines=True, orient='records')
test_qa_df.to_json('../datasets/itml/itml_qa_test.jsonl', lines=True, orient='records')

### Sample and save various fractions of the training a.k.a evaluation set for preliminary assessments

In [None]:
sample_sizes = [10, 30, 50]
for n in sample_sizes:
    sample_df = train_df.sample(n) 
    sample_qa_df = train_qa_df.loc[sample_df.index]
    sample_df.to_json(f'../datasets/itml/itml_mcq_eval_{n}_samples.jsonl', lines=True, orient='records')
    sample_qa_df.to_json(f'../datasets/itml/itml_qa_eval_{n}_samples.jsonl', lines=True, orient='records')