<a href="https://colab.research.google.com/github/xiyuanliang666/AI6130-LLM/blob/main/LLM_Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open-LLM-Benchmark Assignment

This code block installs the datasets library, which is an essential Python package for accessing and processing a wide variety of datasets. Developed by Hugging Face, this library simplifies the process of loading, processing, and evaluating benchmark datasets commonly used in natural language processing (NLP) and machine learning tasks.

In this assignment, the datasets library will be used to load and preprocess datasets from the Open-LLM-Benchmark, which provides a standardized framework for evaluating large language models on various NLP tasks. By running this code, you ensure that all required tools for handling datasets are readily available in your Colab environment.

In [None]:
!pip install datasets



This code block initiates the login process for Hugging Face using the huggingface-cli login command. Hugging Face is a platform offering tools and resources for machine learning, including pre-trained models, datasets, and APIs.

By running this command, you will be prompted to enter your Hugging Face token, which you can generate from your Hugging Face account. Logging in allows you to access private models, datasets, or APIs that require authentication. It also facilitates seamless integration with the Hugging Face ecosystem when working on tasks like model inference, dataset access, or fine-tuning.

In the context of this assignment, logging in ensures that you can fetch resources from the Hugging Face hub required for evaluating language models on the Open-LLM-Benchmark.

In [None]:
# !huggingface-cli login
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `llm` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `llm`


*   Importing Necessary Libraries:

    *   datasets: A library from Hugging Face used for accessing and processing datasets.
    *   json: A Python library for handling JSON data structures.

*   Loading the Evaluation Dataset:

    *   The datasets.load_dataset function loads the Open-LLM-Benchmark dataset, specifically the "questions" split. This dataset contains evaluation questions designed to test the performance of large language models (LLMs) across various tasks.

*   Grouping the Dataset by Source:

    *   The dataset examples are iterated through, and the dataset field in each example is used to group the questions based on their originating dataset.
    *   A dictionary named grouped_datasets is created, where:
        *   Keys represent the source datasets (e.g., task names or domains).
        *   Values are lists of examples belonging to that dataset.

In [None]:
import datasets
import json

eval_set = datasets.load_dataset("Open-Style/Open-LLM-Benchmark", "questions")
grouped_datasets = {}
for example in eval_set['train']:
    dataset = example["dataset"]
    if dataset not in grouped_datasets:
        grouped_datasets[dataset] = []
    grouped_datasets[dataset].append(example)

README.md: 0.00B [00:00, ?B/s]

ARC.json: 0.00B [00:00, ?B/s]

CommonsenseQA.json: 0.00B [00:00, ?B/s]

Hellaswag.json: 0.00B [00:00, ?B/s]

MMLU.json: 0.00B [00:00, ?B/s]

MedMCQA.json: 0.00B [00:00, ?B/s]

OpenbookQA.json: 0.00B [00:00, ?B/s]

Winogrande.json: 0.00B [00:00, ?B/s]

piqa.json: 0.00B [00:00, ?B/s]

race.json: 0.00B [00:00, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
grouped_datasets.keys()

dict_keys(['ARC', 'CommonsenseQA', 'Hellaswag', 'MMLU', 'MedMCQA', 'OpenbookQA', 'Winogrande', 'piqa', 'race'])

*   Importing Required Classes:

    *   AutoTokenizer: A generic class for loading the appropriate tokenizer for a given pre-trained model.
    *   AutoModelForCausalLM: A generic class for loading a causal language model, designed for autoregressive tasks like text generation.
*   Loading the Pre-trained Model:

    *   The AutoModelForCausalLM.from_pretrained function loads the Llama 3.2 (1B parameters) model from the Hugging Face Hub.
    *   device_map="auto" ensures that the model is automatically loaded onto the available hardware (e.g., GPU or CPU) for optimal performance.
*   Loading the Tokenizer:

    *   The AutoTokenizer.from_pretrained function loads the tokenizer corresponding to the same pre-trained model.
    *   padding_side="left" is set to align the input sequences for causal language modeling, where padding tokens are added to the left.

Note: The placeholder TinyLlama/TinyLlama_v1.1 indicates the pre-trained model being used in this example. You need to replace this with the specific model and tokenizer you want to use. For example:

*   If you are testing another model, update the model identifier (e.g., "openai/gpt-3", "EleutherAI/gpt-j-6B", etc.).

*   Ensure the model and tokenizer are compatible with the dataset and task at hand.

You should find the suitable model in this list: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.

In [None]:
from transformers import AutoTokenizer,AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", device_map="auto")

### You have to change your using model and tokenizer here

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]



1.   Device Allocation:

*   The device variable ensures the model runs on the GPU if available for faster inference.
*   Uncomment model = model.to(device) to explicitly move the model to the detected device.

2.   Inference Function (infer_llm):

*   Takes a single dataset sample as input.
*   onstructs a prompt that includes the question, options, and an "Answer:" section where the model generates its response.
*   Customization:

    *   Modify the prompt to match the style of your model (e.g., "Answer in yes/no," or "Answer in a single word").
*   Tokenizes the prompt using the tokenizer and prepares it for the model by converting it to the appropriate device.


*   Generates output using the model:
    *   max_length: Ensures the response length is capped for efficiency.
    *   do_sample=True: Allows for more varied outputs by introducing randomness.
*   Decodes the model's output into a human-readable string and extracts the generated answer.
3.   Evaluation Function (evaluate_samples):

*   Iterates through all samples in a dataset.
*   Calls the infer_llm function to predict answers for each sample.
*   Compares the predicted answers to the correct answers (sample["answerKey"]).
*   Tracks the number of correct predictions and calculates the accuracy.
4.   Dataset Evaluation:

*   The evaluate_samples function is applied to the CommonsenseQA dataset from the grouped datasets.
*   Returns the predicted answers and the model's accuracy on the dataset.

5.   Customization:
*   Prompt Design: Modify the prompt string in infer_llm to align with the expected input format of your chosen LLM.
*   Model Parameters: Adjust settings like max_length and do_sample to balance output quality and computational efficiency.


# Qwen3-4B-Instruct-2507

In [None]:
import time
import tqdm

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "You are solving multiple-choice questions. Answer (ONLY one letter: A, B, C or D). Do NOT add any explanation.\n"
    prompt += """Here are some few shot examples:
Example 1:
Question: The sky is blue because of...
A: Reflection
B: Refraction
C: Rayleigh scattering
D: Absorption
Answer: C

Example 2:
Question: Which animal barks?
A: Cat
B: Dog
C: Cow
D: Sheep
Answer: B
"""
    prompt += "Answer: "

    # Tokenize the input(general)
    # inputs = tokenizer(prompt, return_tensors="pt").to(device)

    #qwen3
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)


    # Generate the output
    # outputs
    generated_ids = model.generate(
        # qwen3
        **inputs,
        # original
        # inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        # attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )
    # qwen3 exclude inputs
    output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist()

    # Decode the output
    # decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # qwen3 - modify
    decoded_output = tokenizer.decode(output_ids, skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['CommonsenseQA'])

100%|██████████| 710/710 [19:54<00:00,  1.68s/it]

Time: 1194.4756064414978





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['CommonsenseQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Predicted Answer: A
Correct Answer: A

Question: What do people aim to do at work?
Predicted Answer: A
Correct Answer: A

Question: Where would you find magazines along side many other printed works?
Predicted Answer: B
Correct Answer: B

Question: Where are  you likely to find a hamburger?
Predicted Answer: A
Correct Answer: A

Question: James was looking for a good place to buy farmland.  Where might he look?
Predicted Answer: D
Correct Answer: A

Question: In what Spanish speaking North American country can you get a great cup of coffee?
Predicted Answer: B
Correct Answer: B

Question: What do animals do when an enemy is approaching?
Predicted Answer: D
Correct Answer: D

Question: Reading newspaper one of many ways to practice your what?
Predicted Answer: A
Correct Answer: A

Question: What do people typically do while playing guitar?
Predicted Answer: E
Correct An

In [None]:
import time
import tqdm

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "You are solving multiple-choice questions. Answer (ONLY one letter: A, B, C or D). Do NOT add any explanation.\n"
    prompt += """Here are some few shot examples:
Example 1:
Question: The sky is blue because of...
A: Reflection
B: Refraction
C: Rayleigh scattering
D: Absorption
Answer: C

Example 2:
Question: Which animal barks?
A: Cat
B: Dog
C: Cow
D: Sheep
Answer: B
"""
    prompt += "Answer: "

    # Tokenize the input(general)
    # inputs = tokenizer(prompt, return_tensors="pt").to(device)

    #qwen3
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)


    # Generate the output
    # outputs
    generated_ids = model.generate(
        # qwen3
        **inputs,
        # original
        # inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        # attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )
    # qwen3 exclude inputs
    output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist()

    # Decode the output
    # decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # qwen3 - modify
    decoded_output = tokenizer.decode(output_ids, skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['OpenbookQA'])

100%|██████████| 491/491 [13:38<00:00,  1.67s/it]

Time: 818.9106993675232





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['OpenbookQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: There is most likely going to be fog around:
Predicted Answer: A
Correct Answer: A

Question: The middle of the day usually involves the bright star nearest to the earth to be straight overhead why?
Predicted Answer: B
Correct Answer: B

Question: The main component in dirt is
Predicted Answer: B
Correct Answer: B

Question: A cactus stem is used to store
Predicted Answer: B
Correct Answer: B

Question: A red-tailed hawk is searching for prey. It is most likely to swoop down on
Predicted Answer: D
Correct Answer: C

Question: The chance of wildfires is increased by
Predicted Answer: A
Correct Answer: A

Question: A positive effect of burning biofuel is
Predicted Answer: C
Correct Answer: C

Question: A Mola Mola might live where?
Predicted Answer: C
Correct Answer: C

Question: An animal that only eats plants is a
Predicted Answer: B
Correct Answer: B

Question: What can feathers on Spheniscidae be used for?
Predicted Answer: A
Correct Answer: A

Question: If you were attacke

In [None]:
import time
import tqdm

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "You are solving multiple-choice questions. Answer (ONLY one letter: A, B, C or D). Do NOT add any explanation.\n"
    prompt += """Here are some few shot examples:
Example 1:
Question: The sky is blue because of...
A: Reflection
B: Refraction
C: Rayleigh scattering
D: Absorption
Answer: C

Example 2:
Question: Which animal barks?
A: Cat
B: Dog
C: Cow
D: Sheep
Answer: B
"""
    prompt += "Answer: "

    # Tokenize the input(general)
    # inputs = tokenizer(prompt, return_tensors="pt").to(device)

    #qwen3
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)


    # Generate the output
    # outputs
    generated_ids = model.generate(
        # qwen3
        **inputs,
        # original
        # inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        # attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )
    # qwen3 exclude inputs
    output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist()

    # Decode the output
    # decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # qwen3 - modify
    decoded_output = tokenizer.decode(output_ids, skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['piqa'])

100%|██████████| 696/696 [19:50<00:00,  1.71s/it]

Time: 1190.1817667484283





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['piqa']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

# DeepSeek-R1-Distill-Qwen-1.5B

In [None]:
from transformers import AutoTokenizer,AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device_map="auto")

### You have to change your using model and tokenizer here

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [None]:
import time
import tqdm
import re

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = """You are solving multiple-choice questions.\n
Choose the correct option and output only one letter (A, B, C, D or E). \n
Do NOT add explanations. No words, no numbers, no punctuation.\n
Answer the question by applying common sense knowledge.\n

Example 1:
Question: The sky is blue because of...\n
A: Reflection\n
B: Refraction\n
C: Rayleigh scattering\n
D: Absorption\n
E: Emission\n
Answer: C\n

Example 2:\n
Question: Which animal barks?\n
A: Cat\n
B: Dog\n
C: Cow\n
D: Sheep\n
E: Pig\n
Answer: B\n

Example 3:\n
Question: Which planet is known as the Red Planet?\n
A: Venus\n
B: Jupiter\n
C: Saturn\n
D: Mars\n
E: None of the above\n
Answer: D\n

Example 4:
Question: Which is the largest mammal?\n
A: Dog\n
B: Cat\n
C: Elephant\n
D: Giraffe\n
E: Blue whale\n
Answer: E\n
Now answer this question:\n"""
    prompt += f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "Answer: "

    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the output
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=False,
    )

    # Decode the output
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    # match = re.search(r"Answer:\s*([A-E])\b", decoded_output, re.IGNORECASE)
    # if match:
    #     answer = match.group(1).upper()
    # else:
    #     answer = "X"
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['CommonsenseQA'])

100%|██████████| 710/710 [03:24<00:00,  3.47it/s]

Time: 204.85829257965088





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['CommonsenseQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Predicted Answer: C
Correct Answer: A

Question: What do people aim to do at work?
Predicted Answer: C
Correct Answer: A

Question: Where would you find magazines along side many other printed works?
Predicted Answer: B
Correct Answer: B

Question: Where are  you likely to find a hamburger?
Predicted Answer: A
Correct Answer: A

Question: James was looking for a good place to buy farmland.  Where might he look?
Predicted Answer: C
Correct Answer: A

Question: In what Spanish speaking North American country can you get a great cup of coffee?
Predicted Answer: C
Correct Answer: B

Question: What do animals do when an enemy is approaching?
Predicted Answer: C
Correct Answer: D

Question: Reading newspaper one of many ways to practice your what?
Predicted Answer: C
Correct Answer: A

Question: What do people typically do while playing guitar?
Predicted Answer: E
Correct An

In [None]:
import time
import tqdm
import re

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = """You are solving multiple-choice questions.\n
Choose the correct option and output only one letter (A, B, C or D). \n
Do NOT add explanations. No words, no numbers, no punctuation.\n
Answer the question by applying common sense knowledge.\n

Example 1:
Question: The sky is blue because of...\n
A: Reflection\n
B: Refraction\n
C: Rayleigh scattering\n
D: Absorption\n
Answer: C\n

Example 2:\n
Question: Which animal barks?\n
A: Cat\n
B: Dog\n
C: Cow\n
D: Sheep\n
Answer: B\n

Example 3:\n
Question: Which planet is known as the Red Planet?\n
A: Venus\n
B: Jupiter\n
C: Saturn\n
D: Mars\n
Answer: D\n

Now answer this question:\n"""
    prompt += f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "Answer: "

    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the output
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=False,
    )

    # Decode the output
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    # match = re.search(r"Answer:\s*([A-E])\b", decoded_output, re.IGNORECASE)
    # if match:
    #     answer = match.group(1).upper()
    # else:
    #     answer = "X"
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['OpenbookQA'])

100%|██████████| 491/491 [02:00<00:00,  4.09it/s]

Time: 120.00753569602966





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['OpenbookQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: There is most likely going to be fog around:
Predicted Answer: B
Correct Answer: A

Question: The middle of the day usually involves the bright star nearest to the earth to be straight overhead why?
Predicted Answer: B
Correct Answer: B

Question: The main component in dirt is
Predicted Answer: B
Correct Answer: B

Question: A cactus stem is used to store
Predicted Answer: D
Correct Answer: B

Question: A red-tailed hawk is searching for prey. It is most likely to swoop down on
Predicted Answer: B
Correct Answer: C

Question: The chance of wildfires is increased by
Predicted Answer: C
Correct Answer: A

Question: A positive effect of burning biofuel is
Predicted Answer: B
Correct Answer: C

Question: A Mola Mola might live where?
Predicted Answer: C
Correct Answer: C

Question: An animal that only eats plants is a
Predicted Answer: B
Correct Answer: B

Question: What can feathers on Spheniscidae be used for?
Predicted Answer: C
Correct Answer: A

Question: If you were attacke

In [None]:
import time
import tqdm
import re

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = """You are solving multiple-choice questions.\n
Choose the correct option and output only one letter (A or B). \n
Do NOT add explanations. No words, no numbers, no punctuation.\n
Answer the question by applying common sense knowledge.\n

Example 1:
Question: The sky is blue because of...\n
A: Reflection\n
B:Rayleigh scattering\
Answer: B\n

Example 2:\n
Question: Which animal barks?\n
A: Cat\n
B: Dog\n
Answer: B\n

Now answer this question:\n"""
    prompt += f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "Answer: "

    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the output
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
  do_sample=False      ,
    )

    # Decode the output
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    # match = re.search(r"Answer:\s*([A-E])\b", decoded_output, re.IGNORECASE)
    # if match:
    #     answer = match.group(1).upper()
    # else:
    #     answer = "X"
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['piqa'])

100%|██████████| 696/696 [02:33<00:00,  4.54it/s]

Time: 153.31937456130981





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['piqa']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: How do I ready a guinea pig cage for it's new occupants?
Predicted Answer: B
Correct Answer: A

Question: Remove soap scum from shower door.
Predicted Answer: B
Correct Answer: B

Question: Neatly wrap up an extension cord.
Predicted Answer: B
Correct Answer: A

Question: how do you put eyelashes on?
Predicted Answer: B
Correct Answer: B

Question: How to clean blinds without tearing them up
Predicted Answer: B
Correct Answer: A

Question: What material is a steel rocking chair made out of?
Predicted Answer: B
Correct Answer: A

Question: To cream butter and sugar together, you can
Predicted Answer: B
Correct Answer: B

Question: How to best cut the meat to place on a grill?
Predicted Answer: B
Correct Answer: B

Question: What alcohol do you pour for a mojito?
Predicted Answer: B
Correct Answer: A

Question: How to start an automatic transmission car.
Predicted Answer: B
Correct Answer: A

Question: How to make sure all the clocks in the house are set accurately?
Predicted A

# TinyLlama_v1.1

In [None]:
from transformers import AutoTokenizer,AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama_v1.1")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama_v1.1", device_map="auto")

### You have to change your using model and tokenizer here

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

In [None]:
import time
import tqdm
import re

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = """You are solving multiple-choice questions.\n
Choose the correct option and output only one letter (A, B, C, or D). \n
Do NOT add explanations. No words, no numbers, no punctuation.\n

Example 1:
Question: The sky is blue because of...\n
A: Reflection\n
B: Refraction\n
C: Rayleigh scattering\n
D: Absorption\n
Answer: C\n

Example 2:\n
Question: Which animal barks?\n
A: Cat\n
B: Dog\n
C: Cow\n
D: Sheep\n
Answer: B\n
Now answer this question:\n"""
    prompt += f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "Answer: "

    # Tokenize the input(general)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the output
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=6,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )

    # Decode the output
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    match = re.search(r"\b[A-D]\b", decoded_output.split("Answer:")[-1].strip())
    if match:
        answer = match.group(0)
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['piqa'])

100%|██████████| 696/696 [03:31<00:00,  3.29it/s]

Time: 211.50935125350952





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['piqa']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: How do I ready a guinea pig cage for it's new occupants?
Predicted Answer: A
Correct Answer: A

Question: Remove soap scum from shower door.
Predicted Answer: C
Correct Answer: B

Question: Neatly wrap up an extension cord.
Predicted Answer: C
Correct Answer: A

Question: how do you put eyelashes on?
Predicted Answer: B
Correct Answer: B

Question: How to clean blinds without tearing them up
Predicted Answer: A
Correct Answer: A

Question: What material is a steel rocking chair made out of?
Predicted Answer: A
Correct Answer: A

Question: To cream butter and sugar together, you can
Predicted Answer: A
Correct Answer: B

Question: How to best cut the meat to place on a grill?
Predicted Answer: C
Correct Answer: B

Question: What alcohol do you pour for a mojito?
Predicted Answer: B
Correct Answer: A

Question: How to start an automatic transmission car.
Predicted Answer: 2

Now answer this
Correct Answer: A

Question: How to make sure all the clocks in the house are set accura

In [None]:
import time
import tqdm
import re

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = """You are solving multiple-choice questions.\n
Choose the correct option and output only one letter (A, B, C, or D). \n
Do NOT add explanations. No words, no numbers, no punctuation.\n

Example 1:
Question: The sky is blue because of...\n
A: Reflection\n
B: Refraction\n
C: Rayleigh scattering\n
D: Absorption\n
Answer: C\n

Example 2:\n
Question: Which animal barks?\n
A: Cat\n
B: Dog\n
C: Cow\n
D: Sheep\n
Answer: B\n
Now answer this question:\n"""
    prompt += f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "Answer: "

    # Tokenize the input(general)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the output
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=6,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )

    # Decode the output
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    match = re.search(r"\b[A-D]\b", decoded_output.split("Answer:")[-1].strip())
    if match:
        answer = match.group(0)
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['OpenbookQA'])

100%|██████████| 491/491 [02:17<00:00,  3.57it/s]

Time: 137.65795993804932





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['OpenbookQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: There is most likely going to be fog around:
Predicted Answer: D
Correct Answer: A

Question: The middle of the day usually involves the bright star nearest to the earth to be straight overhead why?
Predicted Answer: A
Correct Answer: B

Question: The main component in dirt is
Predicted Answer: C
Correct Answer: B

Question: A cactus stem is used to store
Predicted Answer: A
Correct Answer: B

Question: A red-tailed hawk is searching for prey. It is most likely to swoop down on
Predicted Answer: B
Correct Answer: C

Question: The chance of wildfires is increased by
Predicted Answer: Also answer this question
Correct Answer: A

Question: A positive effect of burning biofuel is
Predicted Answer: A
Correct Answer: C

Question: A Mola Mola might live where?
Predicted Answer: A
Correct Answer: C

Question: An animal that only eats plants is a
Predicted Answer: A
Correct Answer: B

Question: What can feathers on Spheniscidae be used for?
Predicted Answer: B
Correct Answer: A

Quest

In [None]:
import time
import tqdm
import re

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = """You are solving multiple-choice questions.\n
Choose the correct option and output only one letter (A, B, C, D or E). \n
Do NOT add explanations. No words, no numbers, no punctuation.\n

Example 1:
Question: The sky is blue because of...\n
A: Reflection\n
B: Refraction\n
C: Rayleigh scattering\n
D: Absorption\n
Answer: C\n

Example 2:\n
Question: Which animal barks?\n
A: Cat\n
B: Dog\n
C: Cow\n
D: Sheep\n
Answer: B\n
Now answer this question:\n"""
    prompt += f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "Answer: "

    # Tokenize the input(general)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the output
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=6,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )

    # Decode the output
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    match = re.search(r"\b[A-D]\b", decoded_output.split("Answer:")[-1].strip())
    if match:
        answer = match.group(0)
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['CommonsenseQA'])

100%|██████████| 710/710 [03:20<00:00,  3.54it/s]

Time: 200.66925644874573





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['CommonsenseQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Predicted Answer: D
Correct Answer: A

Question: What do people aim to do at work?
Predicted Answer: C
Correct Answer: A

Question: Where would you find magazines along side many other printed works?
Predicted Answer: A
Correct Answer: B

Question: Where are  you likely to find a hamburger?
Predicted Answer: E

Now answer the
Correct Answer: A

Question: James was looking for a good place to buy farmland.  Where might he look?
Predicted Answer: A
Correct Answer: A

Question: In what Spanish speaking North American country can you get a great cup of coffee?
Predicted Answer: A
Correct Answer: B

Question: What do animals do when an enemy is approaching?
Predicted Answer: A
Correct Answer: D

Question: Reading newspaper one of many ways to practice your what?
Predicted Answer: 5a
Correct Answer: A

Question: What do people typically do while playing guitar?
Predicted Ans

# Qwen2.5-3B-Instruct

In [None]:
from transformers import AutoTokenizer,AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", torch_dtype="auto", device_map="auto")

### You have to change your using model and tokenizer here

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

In [None]:
import time
import tqdm

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "You are solving multiple-choice questions. Answer (ONLY one letter: A, B, C, D or E). Do NOT add any explanation.\n"
    prompt += """Here are some few shot examples:
Example 1:
Question: The sky is blue because of...
A: Reflection
B: Refraction
C: Rayleigh scattering
D: Absorption
E: Emission
Answer: C

Example 2:
Question: Which animal barks?
A: Cat
B: Dog
C: Cow
D: Sheep
E: Pig
Answer: B
"""
    prompt += "Answer: "

    # Tokenize the input(general)
    # inputs = tokenizer(prompt, return_tensors="pt").to(device)

    #qwen3
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)


    # Generate the output
    # outputs
    generated_ids = model.generate(
        # qwen3
        **inputs,
        # original
        # inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        # attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )
    # qwen3 exclude inputs
    output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist()

    # Decode the output
    # decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # qwen3 - modify
    decoded_output = tokenizer.decode(output_ids, skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['CommonsenseQA'])

100%|██████████| 710/710 [06:16<00:00,  1.89it/s]

Time: 376.21291494369507





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['CommonsenseQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Predicted Answer: A
Correct Answer: A

Question: What do people aim to do at work?
Predicted Answer: B
Correct Answer: A

Question: Where would you find magazines along side many other printed works?
Predicted Answer: B
Correct Answer: B

Question: Where are  you likely to find a hamburger?
Predicted Answer: A
Correct Answer: A

Question: James was looking for a good place to buy farmland.  Where might he look?
Predicted Answer: D
Correct Answer: A

Question: In what Spanish speaking North American country can you get a great cup of coffee?
Predicted Answer: B
Correct Answer: B

Question: What do animals do when an enemy is approaching?
Predicted Answer: D
Correct Answer: D

Question: Reading newspaper one of many ways to practice your what?
Predicted Answer: A
Correct Answer: A

Question: What do people typically do while playing guitar?
Predicted Answer: E
Correct An

In [None]:
import time
import tqdm

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "You are solving multiple-choice questions. Answer (ONLY one letter: A, B, C, D). Do NOT add any explanation.\n"
    prompt += """Here are some few shot examples:
Example 1:
Question: The sky is blue because of...
A: Reflection
B: Refraction
C: Rayleigh scattering
D: Absorption
E: Emission
Answer: C

Example 2:
Question: Which animal barks?
A: Cat
B: Dog
C: Cow
D: Sheep
E: Pig
Answer: B
"""
    prompt += "Answer: "

    # Tokenize the input(general)
    # inputs = tokenizer(prompt, return_tensors="pt").to(device)

    #qwen3
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)


    # Generate the output
    # outputs
    generated_ids = model.generate(
        # qwen3
        **inputs,
        # original
        # inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        # attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )
    # qwen3 exclude inputs
    output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist()

    # Decode the output
    # decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # qwen3 - modify
    decoded_output = tokenizer.decode(output_ids, skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['OpenbookQA'])

100%|██████████| 491/491 [04:26<00:00,  1.85it/s]

Time: 266.05510926246643





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['OpenbookQA']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: There is most likely going to be fog around:
Predicted Answer: A
Correct Answer: A

Question: The middle of the day usually involves the bright star nearest to the earth to be straight overhead why?
Predicted Answer: B
Correct Answer: B

Question: The main component in dirt is
Predicted Answer: A
Correct Answer: B

Question: A cactus stem is used to store
Predicted Answer: B
Correct Answer: B

Question: A red-tailed hawk is searching for prey. It is most likely to swoop down on
Predicted Answer: B
Correct Answer: C

Question: The chance of wildfires is increased by
Predicted Answer: A
Correct Answer: A

Question: A positive effect of burning biofuel is
Predicted Answer: C
Correct Answer: C

Question: A Mola Mola might live where?
Predicted Answer: C
Correct Answer: C

Question: An animal that only eats plants is a
Predicted Answer: B
Correct Answer: B

Question: What can feathers on Spheniscidae be used for?
Predicted Answer: A
Correct Answer: A

Question: If you were attacke

In [None]:
import time
import tqdm

# Ensure model is running on the GPU if available
device = model.device
# model = model.to(device)

# Define a function for inference
def infer_llm(sample):
    question = sample["question"]
    options = sample["options"]
    prompt = f"{question}\n"
    for option in options:
        prompt += f"{option['label']}: {option['text']}\n"

    ### You can change the prompt to to suit the model you are using.
    # Example:
    # Answer in A/B/C/D:
    # Answer in a single word or phrase:

    prompt += "You are solving multiple-choice questions. Answer (ONLY one letter: A or B). Do NOT add any explanation.\n"
    prompt += """Here are some few shot examples:
Example 1:
Question: The sky is blue because of...
A: Reflection
B: Rayleigh scattering
Answer: B

Example 2:
Question: Which animal barks?
A: Cat
B: Dog
Answer: B
"""
    prompt += "Answer: "

    # Tokenize the input(general)
    # inputs = tokenizer(prompt, return_tensors="pt").to(device)

    #qwen3
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)


    # Generate the output
    # outputs
    generated_ids = model.generate(
        # qwen3
        **inputs,
        # original
        # inputs["input_ids"],
        max_new_tokens=2,
        # max_length=len(inputs["input_ids"][0]) + 5, ### You may refer to the max_new_tokens parameter to speed up inference.
        num_return_sequences=1,
        # attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
    )
    # qwen3 exclude inputs
    output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist()

    # Decode the output
    # decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # qwen3 - modify
    decoded_output = tokenizer.decode(output_ids, skip_special_tokens=True)
    answer = decoded_output.split("Answer:")[-1].strip()
    return answer

# Define the evaluation function
def evaluate_samples(samples):
    correct = 0
    total = len(samples)
    predictions = []

    s = time.time()
    for sample in tqdm.tqdm(samples):
        predicted_answer = infer_llm(sample)
        predictions.append(predicted_answer)

        if predicted_answer == sample["answerKey"]:
            correct += 1
    e = time.time()
    print("Time:",e - s)
    accuracy = correct / total * 100
    return predictions, accuracy

# Evaluate the samples
predictions, accuracy = evaluate_samples(grouped_datasets['piqa'])

100%|██████████| 696/696 [06:28<00:00,  1.79it/s]

Time: 388.3427379131317





In [None]:
# Print results
for i, sample in enumerate(grouped_datasets['piqa']):
    print(f"Question: {sample['question']}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"Correct Answer: {sample['answerKey']}\n")

print(f"Accuracy: {accuracy:.2f}%")

Question: How do I ready a guinea pig cage for it's new occupants?
Predicted Answer: B
Correct Answer: A

Question: Remove soap scum from shower door.
Predicted Answer: B
Correct Answer: B

Question: Neatly wrap up an extension cord.
Predicted Answer: B
Correct Answer: A

Question: how do you put eyelashes on?
Predicted Answer: B
Correct Answer: B

Question: How to clean blinds without tearing them up
Predicted Answer: B
Correct Answer: A

Question: What material is a steel rocking chair made out of?
Predicted Answer: A
Correct Answer: A

Question: To cream butter and sugar together, you can
Predicted Answer: B
Correct Answer: B

Question: How to best cut the meat to place on a grill?
Predicted Answer: B
Correct Answer: B

Question: What alcohol do you pour for a mojito?
Predicted Answer: B
Correct Answer: A

Question: How to start an automatic transmission car.
Predicted Answer: B
Correct Answer: A

Question: How to make sure all the clocks in the house are set accurately?
Predicted A