# **Task - Question & Answering**

# Introduction
This notebook evaluates a Large Language Model (LLM) using the Measuring Massive Multitask Language Understanding (MMLU) benchmark, which tests the model on multiple-choice questions across a variety of subjects, including STEM, humanities, and social sciences.

We will follow these steps:
1. Installing the required libraries.
2. Loading the LLM and setting up the API.
3. Selecting and evaluating a subject from the MMLU dataset.
4. Calculating the model's accuracy.

For additional information about the benchmark - https://arxiv.org/pdf/2009.03300


# Step 1: Install Pre-requisites

We need to install the following libraries:
- `openai`: For interacting with the OpenAI API to query the LLM.
- `python-dotenv`: To manage API keys securely using environment variables.
- `datasets`: The datasets library provides easy access to a wide variety of datasets commonly used for natural language processing tasks.
- `tqdm`: Adds progress bars to loops, making it easier to monitor & visualize the progress


In [None]:
%pip install openai==0.28 python-dotenv datasets tqdm

In [None]:
import pandas as pd

import os
import openai
from dotenv import load_dotenv
from datasets import load_dataset
from tqdm import tqdm
import matplotlib.pyplot as plt

# Step 2: Load the LLM

We will load the GPT-3.5 Turbo model using the OpenAI API. The API key is stored in an environment file for security.

The **MMLU benchmark** comprises multiple-choice questions, and the model needs to provide the "correct" answer, such as A,B,C, or D.

**To ensure that the model only produces the expected answer (A,B,C, or D) without any additional information , we set the *max_tokens* parameter to a value 1.**

In [None]:
# Load API key from environment file
load_dotenv(dotenv_path="../apikey.env.txt")  # replace the "file path" with the location of your API key file

# Set API key for OpenAI
APIKEY = os.getenv("APIKEY")
openai.api_key = APIKEY

In [None]:
# Function to get LLM response
def GetModelResponse(system_content, user_content):
    system = {'role': 'system', 'content': system_content}
    user = {'role': 'user', 'content': user_content}

    # Request a single-character response from GPT-3.5 - set max_tokens value to 1
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[system, user],
        max_tokens=1  # assign a value of 1
    )

    # Extract and return the model's response
    content = response.choices[0].message.content
    return content

# Step 3 - Load Test Dataset
**The MMLU benchmark comprises a Q&A test set covering 57 different subjects. Next, we will outline all 57 subject categories, and the user will select a category to test the LLM with**.

In [None]:
# List of subjects in the MMLU datasets
subjects = [
    "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge",
    "college_biology", "college_chemistry", "college_computer_science", "college_mathematics",
    "college_medicine", "college_physics", "computer_security", "conceptual_physics",
    "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic",
    "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science",
    "high_school_european_history", "high_school_geography", "high_school_government_and_politics",
    "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics",
    "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history",
    "high_school_world_history", "human_aging", "human_sexuality", "international_law",
    "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing",
    "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition",
    "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine",
    "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy",
    "virology", "world_religions"
]

print("Available MMLU subjects:")
for i, subject in enumerate(subjects, 1):
    print(f"{i}. {subject}")

# Ask the user to choose a subject
while True:
    try:
        choice = int(input("\nPlease choose a subject number from the list above: "))
        if 1 <= choice <= len(subjects):
            chosen_subject = subjects[choice - 1]
            break
        else:
            print("Invalid choice. Please try again.")
    except ValueError:
        print("Please enter a valid number.")
print(f"\033[34mYou have chosen: {chosen_subject}\033[0m")

*Next*, we will load the selected subject dataset

In [None]:
# Function to load the chosen MMLU dataset
def load_mmlu_dataset(subject):
    print(f"\nLoading the {subject} dataset...")
    dataset = load_dataset("cais/mmlu", subject, split="test")
    print("Dataset loaded successfully.")
    return dataset

# Load the chosen dataset
test_dataset = load_mmlu_dataset(chosen_subject)

print("Dataset loaded successfully.")
print(f"Number of test questions: {len(test_dataset)}")

## **Step 4 - Prompt Construction**

The MMLU benchmark comprises multiple-choice questions, and the model needs to provide the "correct" answer.

We instruct the model as follows: "Please reply with the letter corresponding to your answer (e.g., A, B, C, or D)."

Furthermore, in step 2, we have configured the max_tokens parameter to 1 to ensure that the model only produces the answer without additional information.

In [None]:
# Step 4: Execute the test set on the LLM, compare results, and print accuracy

print(f"\n Running evaluations on {chosen_subject} dataset")

def plot_question_subset(ax, df, start_index, end_index, chosen_subject):
    subset = df.iloc[start_index:end_index]

    colors = subset['Is Correct'].map({'Correct': 'green', 'Incorrect': 'red'})

    ax.bar(subset['Question Number'], [1] * len(subset), color=colors)
    ax.set_xticks(subset['Question Number'])
    ax.set_title(f'Questions {start_index+1} to {end_index} Correct vs Incorrect Answers - {chosen_subject}')
    ax.set_xlabel('Question Number')

    ax.set_ylabel('')
    ax.set_yticks([])
    ax.set_xlim(start_index + 1, end_index + 1)
    ax.set_ylim(0, 1.5)

    # Add legend
    ax.legend(handles=[
        plt.Line2D([0], [0], color='green', lw=4, label='Correct'),
        plt.Line2D([0], [0], color='red', lw=4, label='Incorrect')
    ], loc='upper right', title="Model performance on " + chosen_subject)


def evaluate_mmlu(dataset):
    results = []
    correct = 0
    total = len(dataset)

    for i, item in enumerate(tqdm(dataset, total=total, desc="Evaluating")):
        question = item['question']
        choices = item['choices']
        correct_answer = item['answer']

        prompt = f"{question}\n"
        for j, choice in enumerate(choices):
            prompt += f"{chr(65 + j)}. {choice}\n"

        system_content = "Please respond with the letter corresponding to your answer (e.g., A, B, C, or D)"
        model_answer = GetModelResponse(system_content, prompt)

        is_correct = model_answer.strip().upper() == chr(65 + correct_answer)
        if is_correct:
            correct += 1

        # saving results for CSV
        results.append({
            'Question Number': i + 1,
            'Question': question,
            'Choice A': choices[0],
            'Choice B': choices[1],
            'Choice C': choices[2],
            'Choice D': choices[3],
            'Model Answer': model_answer,
            'Correct Answer': chr(65 + correct_answer),
            'Is Correct': 'Correct' if is_correct else 'Incorrect'
        })


        print(f"\nQuestion {i + 1}:")
        print(question)
        for j, choice in enumerate(choices):
            print(f"{chr(65 + j)}. {choice}")
        print(f"Model's answer: {model_answer}")
        print(f"Correct answer: {chr(65 + correct_answer)}")
        print(f"Result: {'Correct' if is_correct else 'Incorrect'}")
        print("-" * 50)

    accuracy = correct / total


    # Print accuracy
    print(f"\nAccuracy: {accuracy:.2%}")

    # logic to visualize results as subplots
    df = pd.DataFrame(results)
    questions_per_page = 10

    num_plots = (len(df) + questions_per_page - 1) // questions_per_page

    if len(df) % questions_per_page != 0:
        num_plots += 1

    grid_size = (num_plots // 2 + num_plots % 2, 2)
    fig, axes = plt.subplots(grid_size[0], grid_size[1], figsize=(15, 10))
    axes = axes.flatten()

    for plot_num, start in enumerate(range(0, len(df), questions_per_page)):
        end = min(start + questions_per_page, len(df))
        plot_question_subset(axes[plot_num], df, start, end, chosen_subject)

    for i in range(plot_num, len(axes)):
        fig.delaxes(axes[i])

    plt.tight_layout()
    plt.show()

    return accuracy

# Evaluate on the chosen test set
test_accuracy = evaluate_mmlu(test_dataset)

print(f"\nMMLU Accuracy on {chosen_subject}: {test_accuracy:.2%}")