<a href="https://colab.research.google.com/github/swilli21/NLP/blob/master/LLM_Benchmark_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Benchmark
Benchmark datasets are standardized data collections that measure and compare LLM performance across specific capabilities. They typically include input-output pairs and evaluation metrics. Some common benchmark categories:

- Language understanding (GLUE, SuperGLUE): tests tasks like sentiment analysis and question answering
- Knowledge and reasoning (MMLU, BigBench): evaluates factual knowledge and logical reasoning
- Safety and alignment (TruthfulQA, Anthropic’s Constitutional AI evaluations): measures model truthfulness and alignment with intended behavior
- Coding (HumanEval, MBPP): tests code generation and problem-solving abilities
In this lesson, we will use the popular Hugging Face Datasets library to work with benchmark datasets.

The Hugging Face Datasets library
The Hugging Face Datasets library has emerged as a cornerstone tool for accessing and working with benchmark datasets in natural language processing (NLP). This Python-based library provides a unified interface to hundreds of publicly available datasets, making standardized evaluation of language models more accessible and reproducible. The key features and benefits of the library are:

# 1. Centralized Dataset Access
The datasets library offers one-line access to a vast collection of NLP benchmark datasets:
```py
from datasets import load_dataset

# Load a sentiment analysis benchmark
sst2_dataset = load_dataset("glue", "sst2")

# Load a question answering benchmark
squad_dataset = load_dataset("squad")

# Load a general knowledge benchmark
mmlu_dataset = load_dataset("cais/mmlu", "high_school_biology")

```

This unified API eliminates the need to handle different data formats, download procedures, and preprocessing steps across diverse benchmarks.

# 2. Efficient Data Handling
The library is built on Apache Arrow, providing:

- Memory efficiency: Datasets don’t need to be fully loaded into RAM.
- Fast processing: optimized for quick filtering, mapping, and batch operations
- Lazy loading: Data is only processed when needed.

These features are critical when working with large benchmark datasets that might otherwise be unwieldy to manage.

# 3. Preprocessing and Transformation
Datasets provide powerful tools for preparing benchmark data for model evaluation:
```py
# Apply tokenization to an entire dataset at once
tokenized_dataset = dataset.map(
    lambda examples: tokenizer(examples["text"], padding="max_length", truncation=True),
    batched=True
)

# Filter dataset to focus on specific examples
filtered_dataset = dataset.filter(lambda example: len(example["text"]) > 100)

# Create train/validation splits if needed
split_dataset = dataset.train_test_split(test_size=0.2)
```

These operations are executed efficiently and maintain the dataset structure throughout transformations.

# 4. Integrated Metrics
The datasets library includes companion functionality for evaluating model performance:
```py
from datasets import load_metric

# Load the appropriate metric for the benchmark
metric = load_metric("glue", "sst2")  # For sentiment analysis
# OR
metric = load_metric("squad")         # For question answering

# Add predictions and references
metric.add_batch(predictions=predictions, references=references)

# Compute final scores
results = metric.compute()
print(f"Accuracy: {results['accuracy']}")

```

These metrics are tailored to specific benchmarks, ensuring standardized evaluation aligned with published leaderboards. The Hugging Face Datasets library democratizes access to benchmark datasets, making standardized evaluation accessible to researchers and practitioners of all levels. It has become an indispensable tool in the modern NLP evaluation toolkit by providing efficient data handling, unified access patterns, and integrated metrics.

The library significantly reduces the friction of working with diverse benchmarks for language model evaluation, allowing researchers to focus on model improvements rather than data engineering. As benchmark datasets evolve and expand, the datasets library provides a flexible foundation that grows with the field’s needs. We’re ready to implement these steps with different language tasks and evaluate LLMs appropriately!

# Sentiment Analysis Benchmark Datasets
#### What is Sentiment Analysis?
Sentiment analysis is a natural language processing (NLP) technique that identifies and extracts subjective information from text to determine whether the writer’s attitude is positive, negative, or neutral. It’s a widely used NLP technique for analyzing text data in various domains, such as customer feedback, social media conversations, product reviews, etc.

# How do Language Models Implement Sentiment Analysis?
Language models implement sentiment analysis through several mechanisms that leverage their understanding of text using the following strategies:

### 1. Finetuning:

A pre-trained language model would need to be fine-tuned on a sentiment analysis task. The steps involved might look like:

- Base model selection: Start with models like BERT, RoBERTa, T5, or GPT.
- Dataset preparation: Collect labelled sentiment data (positive, negative, neutral).
- Finetuning process: Train the model on sentiment data while preserving general language understanding.
- Classification layer: Add a task-specific layer on top that outputs sentiment probabilities.
### 2. Zero/Few-shot Learning:

This is often done after fine-tuning to enhance the model’s performance on sentiment analysis. It involves the following steps:

- Prompt engineering: Frame sentiment analysis as a completion task.
- Examples: Include demonstrations of sentiment classification in context.
- Output formatting: Structure the prompt to guide the model toward consistent response formats.
As we can see, large text datasets that contain accurately labelled sentiment classes are crucial resources for both these steps. Benchmark datasets are designed exactly for that!

# Benchmark Datasets for Sentiment Analysis
Here’s an overview of some of the most widely used datasets in the field:

| Text Reviews Dataset | Overview | Size | Labels |
|---|---|---|---|
| SST (Stanford Sentiment Treebank) | Derived from movie reviews with fine-grained sentiment labels | ~11,800 sentences | 5-class sentiment scale (very negative to very positive) |
| IMDB Movie Reviews | Large dataset of movie reviews from IMDB | 50,000 reviews | Binary (positive/negative)
 |


| Social Media Dataset | Overview | Size | Labels |
|---|---|---|---|
| Twitter Sentiment Analysis Dataset (Sentiment140) | Collection of tweets with automatically annotated sentiment | 1.6 million tweets | Binary (positive/negative) or neutral in some versions |
| SemEval Twitter Sentiment Analysis datasets | Series of datasets released for SemEval competitions | Varies by year, typically thousands of tweets | 3-class (positive, negative, neutral) or fine-grained|


| Multilingual Dataset | Overview | Size | Labels |
|---|---|---|---|
| XNLI (Cross-lingual Natural Language Inference) | While primarily for NLI, it’s widely used for cross-lingual sentiment | ~7,500 sentence pairs per language | Three classes (entailment, contradiction, neutral) |
| MELD (Multimodal EmotionLines Dataset) | Multimodal dataset from TV show “Friends” | ~13,000 utterances | 7 emotions and sentiment classes |

Sentiment analysis benchmark datasets can also be highly domain-specific, allowing for the construction of models catering to the nuances of different industries (like finance, for instance). Additionally, aspect-based datasets can determine a text’s sentiment with respect to a specific aspect.



In this lesson, we will work with SST-2, the ubiquitously used Stanford Sentiment Analysis dataset. The model card for this dataset can be found on the Hugging Face website here: https://huggingface.co/datasets/stanfordnlp/sst2. We’re going to use the GLUE benchmark, which is a collection of resources for a variety of evaluation tasks, to load the dataset. The model we will evaluate for sentiment analysis is a pre-trained, smaller version of BERT known as distilBERT. Run the setup cells to import the relevant packages, load the dataset and model and take a look at the dataset.


1. Tokenization:

We’ve defined a tokenization function to convert text into tokens that the model can understand. Apply this function to the dataset using .map() and set the argument batched to True.

The dataset format is set to return PyTorch tensors, with the necessary columns specified.


2. Predict Outcomes on the Validation Sample:

Check if the model can predict correctly on new data using a small sample from the validation set. A sample is taken from the validation set. The necessary inputs, input_ids and attention_mask, are prepared, and the label is gathered for later comparison.

torch.no_grad() is defined to disable gradient calculation during prediction, optimizing memory usage and computational efficiency.

Pass inputs to the model to generate outputs, containing the raw predictions (logits).
Apply the softmax() function to these logits using torch.nn.functional.softmax, setting dim=-1 to calculate probabilities across the correct axis. Store these probabilities in the predictions.
Determine the most likely class labels by finding the index with the highest probability in predictions. Store these in predicted_labels.
Checkpoint 3 Step instruction is unavailable until previous steps are completed


3. Decoding and Displaying Predictions:

We’re now ready to convert the model’s input data from tokenized IDs back to readable text and display the predictions alongside actual sentiments.

Decode each sequence in input_ids into readable sentences by applying the tokenizer’s .decode() method with skip_special_tokens enabled, to recover the original text and simplify the interpretation of the model’s inputs.

Create DataFrame results_df to display :

sentences for the readable text
predicted_labels for the model’s predicted sentiments
labels for the actual sentiments
Checkpoint 4 Step instruction is unavailable until previous steps are completed

4. Calculate Precision and Recall:

Now that we can see our model’s predictions, let’s evaluate its performance using metrics from scikit-learn to understand how effectively it classifies sentiment.

Calculate precision and recall using the scikit-learn metrics functions with your actual and predicted sentiment values from results_df. Ensure you configure weighted averaging in both calculations.

**Load the SST-2 dataset from the GLUE benchmark for sentiment analysis, and understand its content of movie reviews labeled as positive (1) or negative (0).**

In [None]:
# Import necessary libraries
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
import warnings
warnings.filterwarnings("ignore")

# Load the SST-2 dataset
dataset = load_dataset("glue", "sst2")

# Convert the training and validation datasets to Pandas DataFrames
train_df = pd.DataFrame(dataset['train'])
validation_df = pd.DataFrame(dataset['validation'])
test_df = pd.DataFrame(dataset['test'])

# Display the first few rows of the training datasets
print("\nTraining Set Sample:")
print(train_df.head())

# (Optional) Use `.head()` to also take a look at the validation and test datasets!

README.md: 0.00B [00:00, ?B/s]

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]


Training Set Sample:
                                            sentence  label  idx
0       hide new secretions from the parental units       0    0
1               contains no wit , only labored gags       0    1
2  that loves its characters and communicates som...      1    2
3  remains utterly satisfied to remain the same t...      0    3
4  on the worst revenge-of-the-nerds clichés the ...      0    4


**The SST-2 (Stanford Sentiment Treebank) is a dataset consisting of movie reviews labeled as positive (1) or negative (0). The data is organized into a training dataset containing 67,349 samples, a validation dataset of 872 samples used and a test dataset of 1,821 samples. The labels are hidden in the test dataset to provide an unbiased final model evaluation.**

- Loading the model

The model we will be evaluating is a pretrained DistilBERT model, a lighter, faster version of BERT with similar performance.

In [None]:
#load the model
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)



Use .map() to apply tokenize_function to the dataset, ensuring batched=True to process multiple samples at once.

In [None]:
# Tokenization
def tokenize_function(examples):
    return tokenizer(examples['sentence'], truncation=True, padding='max_length', max_length=128)

# Apply tokenization to the entire dataset
dataset = dataset.map(tokenize_function, batched=True)

dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

- Use model(**inputs) to generate outputs from the model.

- Convert outputs.logits into probabilities using a function that applies the softmax operation across the last dimension (dim=-1).

- Retrieve the most probable class label from predictions by selecting the index with the highest probability.

In [None]:
# Select a sample from the validation set
# Temporarily remove the torch format to get raw data
validation_dataset = dataset['validation']
original_format = validation_dataset.format
validation_dataset.set_format(type=None)
sample_data_raw = validation_dataset[:5]

# Restore original format to the dataset
validation_dataset.set_format(**original_format)

# Now convert the raw data to torch tensors for model input
inputs = {
    'input_ids': torch.tensor(sample_data_raw['input_ids']),
    'attention_mask': torch.tensor(sample_data_raw['attention_mask'])
}
labels = sample_data_raw['label']

# Perform prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_labels = predictions.argmax(dim=-1).tolist()

In [None]:
!rm -rf ~/.cache/huggingface/datasets/
print("Hugging Face datasets cache cleared. Please re-run the load_dataset cell (JlCbfCeMc91B).")

Hugging Face datasets cache cleared. Please re-run the load_dataset cell (JlCbfCeMc91B).


- Use tokenizer.decode() inside a list comprehension to process inputs['input_ids'], converting tokenized sequences into readable text with skip_special_tokens set to True to remove special tokens.

- Create a pd.DataFrame() storing sentences in 'Sentence', predicted_labels in 'Predicted Sentiment', and labels in 'Actual Sentiment'.

We're using a small sample from the validation to avoid running out of memory and computational limitations. This approach allows us to quickly evaluate the model’s performance without the extensive resource requirements of processing the full dataset. Feel free to try out larger samples in your local or using Google Colab


In [None]:
# Decoding and Displaying Predictions
# Decode the sentences from the tokenizer
sentences = [tokenizer.decode(input_ids, skip_special_tokens=True) for input_ids in inputs['input_ids']]

# Create a DataFrame to display the sentences, predicted sentiments, and actual sentiments
results_df = pd.DataFrame({
    'Sentence': sentences,
    'Predicted Sentiment': predicted_labels,
    'Actual Sentiment': labels
})

# Print the DataFrame
print("\nPredictions and Actual Labels:")
results_df


Predictions and Actual Labels:


Unnamed: 0,Sentence,Predicted Sentiment,Actual Sentiment
0,it's a charming and often affecting journey.,1,1
1,unflinchingly bleak and desperate,0,0
2,allows us to hope that nolan is poised to emba...,1,1
3,"the acting, costumes, music, cinematography an...",1,1
4,"it's slow - - very, very slow.",0,0


The table below shows a sample of prediction results along with their accuracy details:

- The Predicted Sentiment column contains the model's classification (0 = Negative, 1 = Positive).
- The Actual Sentiment column contains the correct labels from the dataset.
- If Predicted Sentiment ≠ Actual Sentiment, the model has made an incorrect prediction.
- In this sample, the model correctly classified both positive (1) and negative (0) sentiments.
- The model is not biased in this particular set of predictions.


- Use precision_score() and recall_score() from scikit-learn to calculate precision and recall.

- Pass 'Actual Sentiment' and 'Predicted Sentiment' values from results_df as inputs.

- Set average to 'weighted' to account for class imbalances in the dataset.

In [None]:
# Calculate Precision and Recall
# Import metrics from scikit-learn
from sklearn.metrics import precision_score, recall_score

# Calculate precision and recall
precision = precision_score(results_df['Actual Sentiment'], results_df['Predicted Sentiment'], average='weighted')
recall = recall_score(results_df['Actual Sentiment'], results_df['Predicted Sentiment'], average='weighted')

# Print metrics
print(f"Model Precision: {precision:.4f}")
print(f"Model Recall: {recall:.4f}")

Model Precision: 1.0000
Model Recall: 1.0000


- The precision score indicates how many predicted positives were actually correct
- The recall score indicates how many actual positives were correctly identified
- The metric scores here indicates the model correctly classifies all sentiments in our small sample dataset
- In the next exercise, we will evaluate model performance through benchmark metrics, which is essential to evaluate model performance

# Evaluating Sentiment Analysis
Now, we’re ready to explore the evaluation of language models on sentiment analysis using benchmark datasets. Sentiment analysis determines whether text expresses positive or negative emotions — a fundamental NLP capability used in applications from social media monitoring to customer feedback analysis. Some practical considerations to keep in mind for analysis in real-world scenarios are:

- Class imbalance: Ensure dominant classes don’t skew metrics.
- Human baseline: Compare with human annotator agreement rates.
- Error analysis: Identify systematic failures for targeted improvements.
- Domain adaptation: Test performance across different domains/contexts.

Effective sentiment analysis models typically require linguistic knowledge, domain adaptation, and sophisticated algorithms to handle nuances like sarcasm, idioms, and context-dependent polarity.

## Dataset, Model and Tokenization
Recall that the Stanford Sentiment Treebank (SST-2) from the GLUE benchmark provides:

- Movie review sentences with binary sentiment labels (positive/negative)
- Standardized splits for training, validation, and testing
- A common reference point for comparing model performance

We’re going to use the pre-trained model DistilBERT, which is:

- A lightweight, efficient version of BERT
- Pre-trained on general language understanding
- Fine-tuned specifically for sentiment analysis

To process text, models require tokenization, which is the process of:

- Converting words into numerical token IDs
- Adding special tokens and padding
- Creating attention masks to handle variable-length inputs

For SST-2, accuracy serves as the primary metric and it:

- Measures the percentage of correctly classified sentiments
- Is appropriate for a balanced binary classification task

### Implementation Process
After loading the model and dataset and implementing tokenization, we’re ready to begin the evaluation process. The evaluation process requires the following steps.

### 1. Converting model outputs (logits) to predictions: We’ve defined a function that compares predictions with ground truth labels:
```py
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

```

### 2. Running Evaluation: The Hugging Face Trainer simplifies the evaluation workflow:
```py
trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics
)

```

### 3. Evaluation: The evaluation process only requires a couple of lines of code now!
```py
results = trainer.evaluate()
accuracy = results["eval_accuracy"]

```

A strong performance for a lightweight model indicates successful sentiment classification in most cases. This evaluation approach enables comparing different models for optimal selection, tracking improvements during experimentation, ensuring models meet quality thresholds before deployment, and contributing comparable results to research.


1. Compute Evaluation Metrics:

Now that we have the model’s predictions, let’s compute evaluation metrics to assess its performance:

- Extract logits and labels from the evaluation dataset to separate model outputs from actual labels.

- Use np.argmax() on logits along the correct axis to determine the predicted class for each input.

- Compute the evaluation metrics by passing predictions and labels as references into the metric computation function.


2. Initialize Trainer for Evaluation:

- - 1. Set up the trainer to evaluate model performance.

- - 2. Evaluation settings are already defined in training_args, with a batch size of 64, evaluation enabled, and external logging disabled.

- - 3. Initialize trainer to manage the evaluation process:

- Assign the pre-trained model to model.
- Use args to apply training_args, ensuring the evaluation settings are followed.
- Set eval_dataset to the validation dataset for assessment.
- Attach compute_metrics to calculate model performance.

3. Evaluate the Model:

Now that we have set up the trainer, let’s evaluate the model using the validation dataset and extract the accuracy score:

- Call evaluate() on trainer to compute evaluation metrics on the validation dataset.

- Extract the "eval_accuracy" score from the evaluation results and store it in accuracy.

In [None]:
# Import Required Libraries
import transformers
import pandas as pd
import numpy as np
from datasets import load_dataset
from evaluate import load
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# load dataset
dataset = load_dataset("glue", "sst2")

# Convert the training and validation datasets to Pandas DataFrames
train_df = pd.DataFrame(dataset['train'])
validation_df = pd.DataFrame(dataset['validation'])
test_df = pd.DataFrame(dataset['test'])

- Use np.argmax() to select the class index with the highest confidence score from logits.

- Ensure you pass predictions and labels to the metric computation function to evaluate performance.

In [None]:
# Loading the model and the tokenizer
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], truncation=True, padding="max_length",max_length=128)

# Tokenize the dataset in batches and set format to return PyTorch tensors
dataset = dataset.map(tokenize_function, batched=True)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

In [None]:
metric = load("glue", "sst2")
print("Loaded metric details:\n\n", metric)

Downloading builder script: 0.00B [00:00, ?B/s]

Loaded metric details:

 EvaluationModule(name: "glue", module_type: "metric", features: {'predictions': Value('int64'), 'references': Value('int64')}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = evaluate.load('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=referenc

For the SST-2 task from the GLUE benchmark, accuracy is the primary evaluation metric, reflecting the percentage of sentiments correctly classified. This focus on accuracy is suitable for this balanced binary classification task, where both positive and negative classes are well-represented.

In [None]:
###  Compute Evaluation Metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
### Initialize Trainer for Evaluation
training_args = TrainingArguments(
    output_dir='./results',
    do_eval=True,
    per_device_eval_batch_size=64,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics
)

In [None]:
### Evaluate Model and Extract Accuracy

# Evaluate the model
evaluation_results = trainer.evaluate()
# Extract and print the accuracy score
accuracy = evaluation_results.get("eval_accuracy")
print(f"GLUE Accuracy: {accuracy:.4f}")



GLUE Accuracy: 0.9106


The fine-tuned DistilBERT model achieved 91.06% accuracy on the SST-2 validation dataset from the GLUE benchmark.

- The model correctly classified 91.06% of the sentiment analysis samples.
- SST-2 is a binary classification task, meaning the model predicts whether a sentence has a positive (1) or negative (0) sentiment.
- 91.06% accuracy is a strong result, especially for a lightweight model like DistilBERT, meaning it has learned to classify sentiments quite well.

# Question and Answer Benchmark Datasets
Question and Answer (Q&A) benchmark datasets are essential for developing and evaluating models that can understand and respond to natural language questions. Here’s an overview of the most significant Q&A benchmarks in the field:

|Open-Domain Q&A Datasets	|Overview	|Size	|Labels|
|---|---|---|---|
| Natural Questions (NQ) |	Real Google search queries with answers from Wikipedia	| ~300,000 questions | Both short answers (entities, dates) and long answers (paragraphs) |
|TriviaQA	| Trivia questions from various websites with evidence documents	| ~95,000 question-answer pairs	| Questions created independently of the evidence documents |

|Reading Comprehension Datasets	|Overview	|Size	|Labels|
|---|---|---|---|
|SQuAD (Stanford Question Answering Dataset)	| Questions on Wikipedia articles with answers as text spans	| SQuAD 1.1: ~100,000 questions; SQuAD 2.0: ~150,000 questions	| SQuAD 2.0 includes unanswerable questions|
|NewsQA	|Questions based on CNN news articles	| ~120,000 question-answer pairs	| Complex questions requiring reasoning|

|Commonsense Reasoning Datasets	|Overview	|Size	|Labels|
|---|---|---|---|
| CommonsenseQA	| Questions requiring commonsense knowledge	| ~12,000 questions	| Multiple-choice format with challenging distractors|
|PIQA (Physical Interaction QA)	| Questions about physical commonsense knowledge	| ~16,000 questions	| Binary choice questions about everyday physical scenarios|

## Evaluation Considerations
When selecting a QA benchmark, researchers typically consider:

- Task alignment: whether the dataset matches the intended application
- Answer type: extractive (span selection) vs. abstractive (generated) answers
- Reasoning complexity: simple fact retrieval vs. multi-hop or commonsense reasoning
- Linguistic diversity: single language vs. multilingual capabilities
- Domain coverage: general knowledge vs. specialized domains

Each benchmark tests different aspects of question answering systems, and performance across multiple benchmarks provides a more comprehensive evaluation of a model’s capabilities. In this exercise, we will use SQUAD 1.1 to evaluate the performance of DistilBERT in Q&A tasks. The SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset for question-answering tasks. It includes a training dataset containing 87,599 samples and a validation dataset of 10,570 samples, each with a question, context, and answer(s). It is widely used to train and evaluate extractive question answering models.


### 1. Initialize Tokenizer and QA Pipeline:

Configure the tokenizer and qa_pipeline for 'question-answering':

- The model is already initialized for question answering using DistilBERT.
- Load the corresponding tokenizer to preprocess input text.
- Set up qa_pipeline with 'question-answering', linking it to model and tokenizer for processing and answering questions.

### 2. Generate Predictions on a Sample:

- Use qa_pipeline to generate predictions for a selected sample from validation_df, storing the prediction details in results, an empty list.

- Iterate through sample_validation, extracting context and question for each row.

- Pass the extracted question and context into qa_pipeline to generate answer, which includes the predicted response and confidence score.

- Append a dictionary to results for each output, containing the Question, truncated Context (200 characters), Predicted Answer, and Score.

In [None]:
# Import necessary libraries
#import pandas as pd
#import numpy as np
#from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering, pipeline

# Load Dataset
dataset = load_dataset("squad")
print("\nDataset Structure:\n\n",dataset)

# 5. Convert to DataFrames
train_df = pd.DataFrame({
    'question': dataset['train']['question'],
    'context': dataset['train']['context'],
    'answer': [ans['text'][0] for ans in dataset['train']['answers']]
})

validation_df = pd.DataFrame({
    'question': dataset['validation']['question'],
    'context': dataset['validation']['context'],
    'answer': [ans['text'][0] for ans in dataset['validation']['answers']]
})

print("\nTraining Set Sample:")
train_df.head()


Dataset Structure:

 DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Training Set Sample:


Unnamed: 0,question,context,answer
0,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha...",Saint Bernadette Soubirous
1,What is in front of the Notre Dame Main Building?,"Architecturally, the school has a Catholic cha...",a copper statue of Christ
2,The Basilica of the Sacred heart at Notre Dame...,"Architecturally, the school has a Catholic cha...",the Main Building
3,What is the Grotto at Notre Dame?,"Architecturally, the school has a Catholic cha...",a Marian place of prayer and reflection
4,What sits on top of the Main Building at Notre...,"Architecturally, the school has a Catholic cha...",a golden statue of the Virgin Mary


In [None]:
### Initialize Tokenizer and QA Pipeline
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')
# Setup the QA pipeline
qa_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)

Device set to use cpu


In [None]:
### Generate Predictions on a Sample
# Use a small sample from the validation dataset for a quick demonstration
sample_validation = validation_df.sample(n=5, random_state=1)

# Generate predictions for the sample
results = []

for _, row in sample_validation.iterrows():
    context = row['context']
    question = row['question']
    answer = qa_pipeline({'question': question, 'context': context})
    results.append({
        'Question': question,
        'Context': context[:200],
        'Predicted Answer': answer['answer'],
        'Score': answer['score']
    })

# Display the results
results_df = pd.DataFrame(results)
results_df



Unnamed: 0,Question,Context,Predicted Answer,Score
0,How is lap provided by overlapping the admissi...,"In the 1840s and 50s, there were attempts to o...",lengthening rubbing surfaces of the valve,0.435091
1,Who started at tight end for the Panthers?,"The Panthers offense, which led the NFL in sco...","Greg Olsen,",0.992418
2,"Besides 1,3 and 7, what other number must all ...","Hence, 6 is not prime. The image at the right ...",9,0.961408
3,How do academic results in former Model C scho...,"In the final years of the apartheid era, paren...",better,0.388785
4,IP and AM are most commonly defined by what ty...,Other important complexity classes include BPP...,Interactive,0.937586


### Model Performance Analysis

- High-confidence answers are mostly correct.
Example: Who started at tight end for the Panthers? → Predicted: Greg Olsen (Score: 0.99)
The model performs well on fact-based, direct questions.

- Low-confidence answers are often incomplete or unclear.
Example: How do academic results in former Model C schools compare? → Predicted: better (Score: 0.38)
The model struggles with open-ended or complex questions.

- Confidence scores help gauge reliability.
Higher scores indicate more confident and likely more accurate predictions.


#### Next

- Evaluate the model on the entire validation dataset using benchmarking metrics to quantify performance.
- If confidence scores are low, consider fine-tuning the model for improved results.

# Evaluating LLMs on Q&A tasks

This exercise explores how to evaluate language models on Q&A tasks using the Stanford Question Answering Dataset (SQuAD). Question answering is a challenging NLP task in which models must extract specific answers from provided context passages — a capability essential for search engines, virtual assistants, and information retrieval systems. SQuAD (Stanford Question Answering Dataset) is a benchmark dataset that contains:

- Questions posed by crowd workers
- Context paragraphs from Wikipedia
- Answers that are spans of text from the context
- Training and validation splits for model development and evaluation

Unlike classification tasks, QA evaluation requires measuring how well a model can locate and extract the exact answer text. Modern QA systems leverage pre-trained language models fine-tuned specifically for extractive question answering. DistilBERT for QA is a lightweight model trained to identify answer spans.

## QA-specific Evaluation Metrics

Two primary metrics are used to evaluate QA performance:

- Exact Match (EM): percentage of predictions that exactly match any of the ground truth answers
- F1 Score: measures the average overlap between prediction and ground truth at the token level

F1 Score is particularly important as it accounts for partial correctness when predictions contain most of the correct answer but miss some words or include extra words.

### Implementation Process

1. Setting up evaluation:
```py
dataset = load_dataset("squad")
model = AutoModelForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')

```

2. Prepare model for inference:
```py
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()  # Set to evaluation mode

```

3. Load evaluation metric:
```py
metric = load_metric('squad')

```


#### 1. Configure Model and Load Evaluation Metric:

Now, we’re going to prepare the model for evaluation and identify the necessary metrics to measure performance. The device is already set to GPU if available, or CPU otherwise, and the model is moved to the selected device. Now:

- Switch the model to evaluation mode to disable training-related behaviors like dropout.
- Set the metric as the 'squad' metric using Hugging Face’s load_metric.
Checkpoint 2 Step instruction is unavailable until previous steps are completed

#### 2. Extract Answers Predicted by the Model:

The get_answer() function is defined to extract the model’s predicted answer. The input question and context have been tokenized, moved to the device, and passed through the model. Now extract the answer:

- Identify answer_start as the index with the highest confidence score by applying argmax() on outputs.start_logits.
- Determine answer_end using argmax() on outputs.end_logits, adding 1 to include the complete answer span.

- Extract answer by decoding the tokens between answer_start and answer_end from inputs['input_ids'] using tokenizer.decode().

#### 3. Format Predictions for Evaluation:

The function add_to_metric() formats and adds batches to the evaluation metric. Now we need to:

- Create a predictions list that includes a dictionary with the current index as id and the predicted answer as prediction_text.
- Create a references list containing a dictionary with the current index as id and answers containing a dictionary with:
1. text: a list with the correct answer from row['answer']
2. answer_start: a list containing [0]

#### 4. Run Evaluation Loop and Compute Scores:

Now it’s time to put everything together and see how well our model performs! We’ll run the evaluation loop on a subset of the validation dataset under torch.no_grad() to disable gradient calculations for efficiency.

- For each row, call the get_answer() function to get the predicted answer and inputs.
- After processing all rows, compute the final_scores from the metric.


In [None]:
# import necessary libraries
import evaluate
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
#from datasets import load_dataset, load_metric
import torch
#import pandas as pd

# Load Dataset
dataset = load_dataset("squad")
print("\nDataset Structure:")
print(dataset)

# Convert to DataFrames
train_df = pd.DataFrame({
    'question': dataset['train']['question'],
    'context': dataset['train']['context'],
    'answer': [ans['text'][0] for ans in dataset['train']['answers']]
})

validation_df = pd.DataFrame({
    'question': dataset['validation']['question'],
    'context': dataset['validation']['context'],
    'answer': [ans['text'][0] for ans in dataset['validation']['answers']]
})

# Initialize the model and tokenizer for question answering
model = AutoModelForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')



Dataset Structure:
DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [None]:
# Configure Model and Load Evaluation Metric
# Move model to GPU if available
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Set the model to evaluation mode
model.eval()

# Load the SQuAD evaluation metric
metric = load('squad')

print("Loaded metric details:\n\n", metric)

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Loaded metric details:

 EvaluationModule(name: "squad", module_type: "metric", features: {'predictions': {'id': Value('string'), 'prediction_text': Value('string')}, 'references': {'id': Value('string'), 'answers': {'text': List(Value('string')), 'answer_start': List(Value('int32'))}}}, usage: """
Computes SQuAD scores (F1 and EM).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair (see above),
        - 'answers': a Dict in the SQuAD dataset format
            {
                'text': list of possible texts for the answer, as a list of strings
                'answer_start': list of start positions for the answer, as a list of ints
            }
            Note that answer_sta

SQuAD Evaluation Metric Loaded
The SQuAD evaluation metric is successfully downloaded and ready to use.

- Metric Type: squad (used for question answering tasks)
- What the Metric Needs:
- - Predictions: Each prediction must have:
id: A unique ID for each question
prediction_text: The model's predicted answer
- - References (Correct Answers): Each reference must have:
id: Same ID as the prediction to match them
answers: A list of correct answers
- What the Metric Calculates:
- - Exact Match (EM): How many predicted answers exactly match a correct answer.
- - F1 Score: How much of the predicted answer overlaps with the correct answer, even if it's not exact.
The metric is now ready to check model performance after adding predictions and correct answers.

In [None]:
# Extract Answers Predicted by the Model
def get_answer(row):
    # Tokenization and device handling
    inputs = tokenizer(row['question'], row['context'], return_tensors='pt', padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Pass inputs through model
    outputs = model(**inputs)

    ### YOUR SOLUTION HERE ###
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1
    answer = tokenizer.decode(inputs['input_ids'][0, answer_start:answer_end])
    return answer, inputs

In [None]:
# Format Predictions for Evaluation
def add_to_metric(i, answer, row):
    ### YOUR SOLUTION HERE ###
    predictions = [{'id': str(i), 'prediction_text': answer}]
    references = [{'id': str(i), 'answers': {'text': [row['answer']], 'answer_start': [0]}}]
    metric.add_batch(predictions=predictions, references=references)

In [None]:
# Run Evaluation Loop and Compute Scores
num_samples = 100
with torch.no_grad():
    for i, row in validation_df.head(num_samples).iterrows():
        ## YOUR SOLUTION HERE ##
        answer, inputs = get_answer(row)
        add_to_metric(i, answer, row)

## YOUR SOLUTION HERE ##
final_scores = metric.compute()
print(f"F1 Score: {final_scores['f1']:.2f}")
print(f"\nExact Match Score: {final_scores['exact_match']:.2f}")

F1 Score: 73.02

Exact Match Score: 65.00


### Model Performance Summary
The F1 score indicates the model is good at capturing relevant words even when the full answer does not match exactly.


The Exact Match score suggests that some answers are close but not exact, meaning the model may be rephrasing or missing minor details.

### Potential Improvement
Using a larger model (e.g., bert-large-uncased-whole-word-masking-finetuned-squad) might improve performance.

# General Knowledge Benchmark Datasets

Now, we will explore how to evaluate language models on general knowledge tasks using benchmark datasets. General knowledge evaluation measures a model’s ability to recall facts, understand relationships between entities, and apply common-sense reasoning — capabilities essential for virtual assistants, educational tools, and information retrieval systems.

## General Knowledge Benchmark Datasets
Several benchmark datasets assess different aspects of general knowledge:

- TriviaQA: contains over 95,000 question-answer pairs authored by trivia enthusiasts, covering history, science, and popular culture
- Natural Questions (NQ): questions from real Google Search queries with answers from Wikipedia
- WebQuestions: real questions from web searches paired with Freebase entity answers
- MMLU (Massive Multitask Language Understanding): tests knowledge across 57 subjects from elementary to professional levels

These datasets provide standardized ways to measure how well models understand and recall factual information across diverse domains.

## Knowledge-Enhanced Language Models
Modern language models acquire general knowledge during pre-training:

- Knowledge acquisition: Models learn facts from their training data (web text, books, Wikipedia).
- Knowledge retrieval: During inference, models must recall and apply this knowledge.
- Domain coverage: Some models are specifically enhanced with structured knowledge bases.

The depth and accuracy of this knowledge vary across model architectures and training approaches.

## Evaluation Metrics for Knowledge Tasks

General knowledge evaluation uses specialized metrics:

- Accuracy: percentage of questions answered correctly (for multiple-choice formats)
- Exact Match (EM): for open-ended questions, measures if the prediction exactly matches the reference answer
- F1 Score: measures partial token overlap between prediction and reference answer
- Domain-specific performance: accuracy broken down by knowledge domains (science, history, etc.)

## Practical Implications
Understanding general knowledge evaluation enables:

- Targeted model selection: Choose models with strengths in domains relevant to your application.
- Knowledge augmentation: Identify areas where additional facts or training might improve performance.
- Confidence thresholds: Set appropriate thresholds for when a model should admit uncertainty.
- Educational applications: Use domain-specific performance to guide the development of learning tools.

By mastering general knowledge evaluation, you can confidently deploy language models in applications where factual accuracy and breadth of knowledge are critical requirements. In this exercise, we will be working with the CAIS/MMLU dataset from Hugging Face (https://huggingface.co/datasets/cais/mmlu).

This dataset contains multiple-choice questions from various domains spanning a variety of subjects in the humanities and sciences. It covers 57 tasks, including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem-solving ability. (We will use a small, lightweight version of FLAN-T5 and see that it does not perform well at all! LLMs that consistently do well at GK tasks are typically bigger, and we encourage you to experiment with heavier models on your local.)


#### 1. Generate Predictions for Sample Questions:

The generate_prediction() function is defined to generate model predictions for each sample question. We’ve formatted and tokenized the input text, moved it to the device, and disabled gradients for efficient inference. Now:

- a. Generate the model’s prediction with the inputs and limit the responses to 50, then store the result in outputs.
- b. Decode the generated output into readable text using tokenizer.decode(), skipping special tokens and stripping whitespace, then store it in predicted_text.

#### 2. Compare Predictions with Correct Answers:

We have defined a function compare_and_collect() to compare the model’s prediction with the correct answer and return all the results. Now:

- a. Extract the correct answer from the row’s choices using the row’s answer index and store it in correct_answer.
- b. Compare the predicted text with the correct answer using .lower() for case-insensitive comparison and store the result in is_correct.

#### 3. Process All Sample Questions:

We’ve set up the loop to process all the sample questions and collect the results in a list to store them in the DataFrame results_df. Now:

- a. Call generate_prediction() with the row to get the prediction and store it in pred.
- b. Call compare_and_collect with the row and prediction to get the results and store them in res.
Checkpoint 4 Step instruction is unavailable until previous steps are completed

#### 4. Evaluate Model Performance:

Now that results_df contains the model’s predictions, let’s assess the overall performance of the model using accuracy and F1 score:

- a. Calculate accuracy using accuracy_score() by comparing the correct column against a reference list of True values with the same length.
- b. Calculate F1 score using f1_score() with the same comparison, using average set to binary.

#### 5. Analyze Performance by Subject:

Now that we’ve evaluated overall model performance, let’s analyze how well it performed across different subjects.

Group results_df by subject and compute the mean of the correct column to determine the percentage of correct predictions per subject.

In [None]:
# Import necessary libraries
#import pandas as pd
#import torch
#import numpy as np
#import evaluate
#from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

dataset=load_dataset("cais/mmlu", "all")
dataset

# Function to convert dataset to DataFrame
def dataset_to_df(data_split):
    data_list = []
    for example in data_split:
        data_list.append({
            'subject': example['subject'],
            'question': example['question'],
            'choices': example['choices'],
            'answer_index': example['answer']
        })
    return pd.DataFrame(data_list)

# Convert all four splits to DataFrames
test_df = dataset_to_df(dataset["test"])
validation_df = dataset_to_df(dataset["validation"])
dev_df = dataset_to_df(dataset["dev"])
auxiliary_train_df = dataset_to_df(dataset["auxiliary_train"])

README.md: 0.00B [00:00, ?B/s]

dataset_infos.json: 0.00B [00:00, ?B/s]

all/test-00000-of-00001.parquet:   0%|          | 0.00/3.50M [00:00<?, ?B/s]

all/validation-00000-of-00001.parquet:   0%|          | 0.00/408k [00:00<?, ?B/s]

all/dev-00000-of-00001.parquet:   0%|          | 0.00/76.5k [00:00<?, ?B/s]

all/auxiliary_train-00000-of-00001.parqu(…):   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/14042 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1531 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/285 [00:00<?, ? examples/s]

Generating auxiliary_train split:   0%|          | 0/99842 [00:00<?, ? examples/s]

In [None]:
def extract_diverse_sample(dataframe, sample_size=10):
    np.random.seed(10)  # Set seed inside function to ensure reproducibility

    unique_subjects = dataframe['subject'].unique()
    selected_subjects = np.random.choice(unique_subjects, size=min(len(unique_subjects), sample_size), replace=False)

    # Sample one question per selected subject
    sample_df = dataframe[dataframe['subject'].isin(selected_subjects)].groupby('subject').sample(n=1, random_state=42)

    # If fewer than 10 subjects, randomly sample more to fill the rest
    if len(sample_df) < sample_size:
        remaining_samples = dataframe[~dataframe.index.isin(sample_df.index)].sample(
            n=sample_size - len(sample_df), random_state=42
        )
        sample_df = pd.concat([sample_df, remaining_samples])

    return sample_df.reset_index(drop=True)

# Extract sample data
sample_data = extract_diverse_sample(validation_df, sample_size=10)

# Display sample
sample_data

Unnamed: 0,subject,question,choices,answer_index
0,astronomy,You cool a blackbody to half its original temp...,[Power emitted is 1/16 times as high; peak emi...,3
1,business_ethics,"______, ________ and _______ are all factors w...","[Resource intensity, Culture clash, Accountabi...",0
2,global_facts,There are two billion children in the world to...,"[4 billion, 3 billion, 2 billion, 1 billion]",2
3,high_school_european_history,This question refers to the following informat...,[women were better suited than men for the tas...,0
4,high_school_government_and_politics,Which of the following processes best exemplif...,"[Amending the United States Constitution, Impo...",0
5,high_school_microeconomics,When marginal cost equals price in a perfectly...,[There is a socially optimal or efficient outp...,2
6,high_school_physics,A light bulb rated at 100 W is twice as bright...,"[The bulbs have the same brightness., The 100-...",2
7,high_school_us_history,This question refers to the following informat...,"[The granting of political rights to women, Th...",2
8,human_aging,One difference between the friendships of olde...,"[Have far fewer friends, Have more cross gende...",1
9,nutrition,Which statement best describes the process of ...,"[All of the options given are correct, Vitamin...",0


In [None]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

# Load a smaller model with reasoning capabilities
print("Loading smaller model with reasoning capabilities...")
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Use CPU
device = "cpu"
model.to(device)

print("Model loaded successfully")


Loading smaller model with reasoning capabilities...


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Model loaded successfully


In [None]:
### Generate Predictions for Sample Questions
def generate_prediction(row):
    input_text = f"Question: {row['question']}\nChoices:\n" + "\n".join(row['choices'])
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True).to(device)
    with torch.no_grad():
        ## YOUR SOLUTION HERE ##
        outputs = model.generate(**inputs, max_length=50)
    predicted_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    return predicted_text

In [None]:
### Compare Predictions with Correct Answers
def compare_and_collect(row, predicted_text):
    ## YOUR SOLUTION HERE ##
    correct_answer = row['choices'][row['answer_index']]
    is_correct = predicted_text.lower() == correct_answer.lower()
    return {
        'subject': row['subject'],
        'question': row['question'],
        'choices': row['choices'],
        'actual_answer': correct_answer,
        'predicted_answer': predicted_text,
        'correct': is_correct
    }

In [None]:
### Process All Sample Questions
results = []
for _, row in sample_data.iterrows():
    pred = generate_prediction(row)
    res = compare_and_collect(row, pred)
    results.append(res)

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,subject,question,choices,actual_answer,predicted_answer,correct
0,astronomy,You cool a blackbody to half its original temp...,[Power emitted is 1/16 times as high; peak emi...,Power emitted is 1/16 times as high; peak emis...,Power emitted is 1/16 times as high; peak emis...,False
1,business_ethics,"______, ________ and _______ are all factors w...","[Resource intensity, Culture clash, Accountabi...","Resource intensity, Culture clash, Accountability",Coordinative fluidity,False
2,global_facts,There are two billion children in the world to...,"[4 billion, 3 billion, 2 billion, 1 billion]",2 billion,2 billion,True
3,high_school_european_history,This question refers to the following informat...,[women were better suited than men for the tas...,women were better suited than men for the task...,women were better suited than men to run schools,False
4,high_school_government_and_politics,Which of the following processes best exemplif...,"[Amending the United States Constitution, Impo...",Amending the United States Constitution,Amending the United States Constitution Imposi...,False
5,high_school_microeconomics,When marginal cost equals price in a perfectly...,[There is a socially optimal or efficient outp...,It is a sign of high concentration among sellers.,a socially optimal or efficient output and price,False
6,high_school_physics,A light bulb rated at 100 W is twice as bright...,"[The bulbs have the same brightness., The 100-...",The 50-W bulb is twice as bright.,The 100-W bulb is twice as bright. The 50-W bu...,False
7,high_school_us_history,This question refers to the following informat...,"[The granting of political rights to women, Th...",The moral justification of slavery,The moral justification of slavery,True
8,human_aging,One difference between the friendships of olde...,"[Have far fewer friends, Have more cross gende...",Have more cross gender friendships,Have far fewer friends Have more cross gender ...,False
9,nutrition,Which statement best describes the process of ...,"[All of the options given are correct, Vitamin...",All of the options given are correct,Oxidised vitamin C (dehydroascorbic acid) is a...,False


In [None]:
### Evaluate Model Performance
from sklearn.metrics import accuracy_score, f1_score


accuracy = accuracy_score(results_df['correct'], [True] * len(results_df['correct']))
f1 = f1_score(results_df['correct'], [True] * len(results_df['correct']), average='binary')

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"F1 Score: {f1 * 100:.2f}%")


Model Performance:
Accuracy: 20.00%
F1 Score: 33.33%


### Model Performance Analysis
 Accuracy: 20.00% - The model correctly predicted 20% of the total cases.

F1 Score: 33.33% - The balance between precision and recall, indicating room for improvement in distinguishing correct and incorrect predictions.

Interpretation:

Low Accuracy (20%) suggests that the model struggles with overall correctness.
F1 Score (33.33%) being higher than accuracy indicates that while the model makes correct predictions, it may also have many false positives or false negatives.

In [None]:
### Analyze Performance by Subject

subject_performance = results_df.groupby('subject')['correct'].mean() * 100

# Show performance by subject area (percentage of correct predictions per subject)
print("\nPerformance by subject:")
subject_performance


Performance by subject:


Unnamed: 0_level_0,correct
subject,Unnamed: 1_level_1
astronomy,0.0
business_ethics,0.0
global_facts,100.0
high_school_european_history,0.0
high_school_government_and_politics,0.0
high_school_microeconomics,0.0
high_school_physics,0.0
high_school_us_history,100.0
human_aging,0.0
nutrition,0.0


### Performance Analysis by Subject
- High Accuracy (100% Correct Predictions):
The model performed exceptionally well in subjects like global_facts and high_school_us_history, where all predictions were correct.

- Low Accuracy (0% Correct Predictions):
In subjects like astronomy, business_ethics, and high_school_physics, the model failed to make any correct predictions.

- Performance Variation:
The accuracy levels differ across subjects, indicating that the model performs well in certain domains but struggles in others.

### LLM Leaderboards
We’ve successfully evaluated LLMs on three different language tasks using benchmark datasets — sentiment analysis, question answering, and general knowledge. We will conclude this lesson with a brief overview of LLM leaderboards and a summary of the current state of LLM evaluation.

The rapid advancement of Large Language Models (LLMs) has spawned a large and sophisticated ecosystem of benchmarks, creating a need for the AI/ML community to keep track of the most up-to-date performance of LLMs collectively. LLM leaderboards are platforms that track and rank LLMs based on their performance and collectively serve as the field’s measuring stick.

LLM leaderboards typically contain key components such as model information, benchmark scores, overall ranking, submission date and methodological details. Several leaderboards have become central to tracking LLM progress:

- Hugging Face Open LLM Leaderboard: focuses on open-source models across multiple benchmark tasks
- Stanford CRFM HELM Leaderboard: provides detailed breakdowns of performance across multiple dimensions
- LMSYS Chatbot Arena: ranks models based on human preference through head-to-head comparisons
- Papers With Code Leaderboards: tracks state-of-the-art performance on specific tasks with links to implementing papers

#### Why Evaluating LLMs is still a WIP
Leaderboard integrity depends on standardized submission processes such as model registration, standardized evaluation, verification, publication and optional peer review. As we can see, this is very much a work in progress, and a lot depends on the consensus within the community of AI researchers. A persistent challenge in the benchmark-leaderboard ecosystem is the “benchmark chase.” This refers to the fact that often models might be explicitly optimized for benchmark performance, which can lead to overfitting to specific benchmark formats and misalignment with real-world performance.

Current benchmarks and leaderboards also face representation issues:

- Predominance of English-language evaluation
- Cultural biases in knowledge and reasoning tasks
- Limited coverage of specialized domains
- Challenges in measuring emergent capabilities not anticipated during benchmark design
- Evaluation Methodology Debates

LLM benchmarks and leaderboards form a symbiotic ecosystem that drives progress in the field. Benchmarks provide standardized measurements, while leaderboards track improvements over time and across models. Together, they create competitive pressure that accelerates innovation while establishing common standards.

However, this ecosystem must continually evolve to remain relevant. As models become more capable, benchmarks must become more sophisticated. As societal implications of LLMs grow, evaluation frameworks must expand to incorporate broader considerations of responsible AI. The most useful benchmarks and leaderboards don’t just track capabilities — they shape the direction of research by highlighting what we’ve achieved and what remains to be solved. They serve not just as scoreboards but as roadmaps for the future of artificial intelligence.

In [None]:
!pip install datasets==2.11.0





In [None]:
# Uninstall the conflicting fsspec and gcsfs, then install compatible versions explicitly
!pip uninstall -y fsspec gcsfs
!pip install fsspec==2025.3.0 gcsfs==2025.3.0

Found existing installation: fsspec 2025.3.0
Uninstalling fsspec-2025.3.0:
  Successfully uninstalled fsspec-2025.3.0
Found existing installation: gcsfs 2025.3.0
Uninstalling gcsfs-2025.3.0:
  Successfully uninstalled gcsfs-2025.3.0
Collecting fsspec==2025.3.0
  Using cached fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting gcsfs==2025.3.0
  Using cached gcsfs-2025.3.0-py2.py3-none-any.whl.metadata (1.9 kB)
Using cached fsspec-2025.3.0-py3-none-any.whl (193 kB)
Using cached gcsfs-2025.3.0-py2.py3-none-any.whl (36 kB)
Installing collected packages: fsspec, gcsfs
Successfully installed fsspec-2025.3.0 gcsfs-2025.3.0


In [None]:
pip install -U datasets fsspec huggingface-hub

Collecting datasets
  Downloading datasets-4.4.2-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2026.1.0-py3-none-any.whl.metadata (10 kB)
Collecting huggingface-hub
  Downloading huggingface_hub-1.3.1-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting fsspec
  Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Downloading datasets-4.4.2-py3-none-any.whl (512 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.10.0-py3-none-any.whl (200 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.0/201.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-1.3.1-py3-none-any.whl (533 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m533.4/533.4 kB[0m [31m10.1 MB/s[0m eta [

In [None]:
pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Using cached fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached fsspec-2025.10.0-py3-none-any.whl (200 kB)
Installing collected packages: fsspec, evaluate
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2026.1.0
    Uninstalling fsspec-2026.1.0:
      Successfully uninstalled fsspec-2026.1.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2025.10.0 which is incompatible.[0m[31m
[0mSuccessfully installed evaluate-0.4.6 fsspec-2025.1

In [None]:
pip install "huggingface-hub>=0.34.0,<1.0" "transformers" -U

Collecting huggingface-hub<1.0,>=0.34.0
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Downloading huggingface_hub-0.36.0-py3-none-any.whl (566 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.1/566.1 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface_hub 1.3.1
    Uninstalling huggingface_hub-1.3.1:
      Successfully uninstalled huggingface_hub-1.3.1
Successfully installed huggingface-hub-0.36.0


In [None]:
pip install transformers --force-reinstall

Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting filelock (from transformers)
  Downloading filelock-3.20.3-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.4.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Collecting packaging>=20.0 (from transformers)
  Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading pyyaml-6.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (2.4 kB)
Coll