<a href="https://colab.research.google.com/github/ranga-godhandaraman/LLM-Benchmark/blob/main/MMLU_SQUAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
from transformers import AlbertTokenizer, AlbertModel, DistilBertTokenizer, DistilBertModel
import numpy as np

In [None]:
# Load pre-trained ALBERT model and tokenizer
albert_model_name = 'albert-base-v2'
albert_tokenizer = AlbertTokenizer.from_pretrained(albert_model_name)
albert_model = AlbertModel.from_pretrained(albert_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Load pre-trained DistilBERT model and tokenizer
distilbert_model_name = 'distilbert-base-uncased'
distilbert_tokenizer = DistilBertTokenizer.from_pretrained(distilbert_model_name)
distilbert_model = DistilBertModel.from_pretrained(distilbert_model_name)

In [None]:
# Sample input text
input_text = "Diabetes management traditionally relies on standardized approaches, neglecting individual needs. Data science offers a transformative opportunity to personalize care. This article explores the impact of data science on diabetes care, focusing on key areas of application and the ethical considerations involved. I’ve combined insights from relevant research, medical literature, and current technology advancements in data science.Data science holds significant potential to revolutionize diabetes care by fostering personalized and effective treatment strategies. However, ensuring patient privacy and ethical data-driven practices remain crucial. This integrated approach holds promise for improved outcomes and quality of life for individuals living with diabetes.Instead of imagining a fire, think of predicting a storm. Data science analyzes your health data (blood sugar, family history, etc.) to identify early signs of trouble. This allows doctors to recommend preventive measures like healthy eating, exercise, or even medication, potentially delaying or even preventing diabetes altogether.Think of this like creating a unique recipe just for your taste buds. Data science considers your genetic makeup, daily routines, and even your living environment to craft a personalized treatment plan. This could involve specific food choices, exercise programs, or medication dosages that are most effective for you.Imagine having a fitness tracker for your whole body. Wearable devices and sensors continuously collect data about your blood sugar, activity levels, and sleep patterns. Doctors can then analyze this data remotely, allowing them to adjust your treatment plan in real time and prevent potential complications before they arise.Think of this as unlocking a secret code. Data science analyzes massive datasets of patient responses to different medication combinations. This allows doctors to identify the unique mix that will be most effective for you, with minimal side effects. This personalized approach can significantly improve your treatment outcomes and overall well-being.Data security is like building a fortress around your health information. Robust encryption methods and strict data protection regulations ensure that your medical data stays safe and confidential. You can be confident that your information is only used for your healthcare and is never shared without your consent."

In [None]:
# Tokenize input text
albert_inputs = albert_tokenizer(input_text, return_tensors="pt")
distilbert_inputs = distilbert_tokenizer(input_text, return_tensors="pt")

In [None]:
# ALBERT Forward pass
with torch.no_grad():
    albert_outputs = albert_model(**albert_inputs)

In [None]:
# DistilBERT Forward pass
with torch.no_grad():
    distilbert_outputs = distilbert_model(**distilbert_inputs)

In [None]:
# Get layer activations
albert_layer_activations = albert_outputs.last_hidden_state
distilbert_layer_activations = distilbert_outputs.last_hidden_state

In [None]:
# Calculate layer utilization
albert_layer_utilization = torch.mean((albert_layer_activations != 0).float(), dim=0).numpy()
distilbert_layer_utilization = torch.mean((distilbert_layer_activations != 0).float(), dim=0).numpy()

In [None]:
# Calculate MMLU
albert_mmlu = np.mean(np.max(albert_layer_utilization, axis=1))
distilbert_mmlu = np.mean(np.max(distilbert_layer_utilization, axis=1))

In [None]:
print("ALBERT MMLU:", albert_mmlu)
print("DistilBERT MMLU:", distilbert_mmlu)

ALBERT MMLU: 1.0
DistilBERT MMLU: 1.0


## With SQUAD dataset

In [None]:
!pip install --upgrade transformers datasets
!pip install transformers datasets



In [None]:
from datasets import load_dataset
# Load SQuAD dataset
squad_dataset = load_dataset("squad")

In [None]:
# Sample a few passages and questions from SQuAD
passages = squad_dataset['train']['context'][:10]
questions = squad_dataset['train']['question'][:10]

In [None]:
#As we have already initialized, we don't do it now again!!
# # Initialize ALBERT tokenizer and model
# albert_tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2")
# albert_model = AlbertModel.from_pretrained("albert-base-v2")

# # Initialize DistilBERT tokenizer and model
# distilbert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# distilbert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")

In [None]:
from transformers import DataCollatorWithPadding

In [None]:
# Function to calculate MMLU for a given model and tokenizer
def calculate_mmlu(model, tokenizer, passages):
    inputs = tokenizer(passages, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    activations = outputs.last_hidden_state
    layer_utilization = torch.mean((activations != 0).float(), dim=0)
    mmlu = torch.mean(torch.max(layer_utilization, dim=1).values).item()
    return mmlu

In [None]:
# Calculate MMLU for ALBERT
albert_mmlu = calculate_mmlu(albert_model, albert_tokenizer, passages)
print("ALBERT MMLU:", albert_mmlu)

ALBERT MMLU: 1.0


In [None]:
# Calculate MMLU for DistilBERT
distilbert_mmlu = calculate_mmlu(distilbert_model, distilbert_tokenizer, passages)
print("DistilBERT MMLU:", distilbert_mmlu)

DistilBERT MMLU: 1.0
