<center><h2><strong><font color="blue">WFH 2024 idBigData - Evaluasi LLM</font></strong></h2></center>

<img alt="" src="images/covers/cover_taudata_uin.jpg"/>

# Jangan lupa mengganti Runtime menjadi GPU di Google Colab

<center><h2><strong><font color="blue">Contoh 01: Akurasi</font></strong></h2></center>

* small public LLM, like DistilBERT from Hugging Face’s Transformers library, on a text classification task.
* the IMDb movie review dataset for binary sentiment classification

In [None]:
import warnings; warnings.simplefilter('ignore') 

try:
    import google.colab; IN_COLAB = True
    print("Installing the required modules")
    !pip install transformers peft datasets --q
    print("preparing directories and assets")
    !mkdir data images output models
    #!wget https://raw.githubusercontent.com/taudata...
except:
    IN_COLAB = False
    print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")

In [None]:
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score

seed = 0
np.random.seed(seed)
"Done"

# Loading Model & Dataset

In [None]:
# Load a small pre-trained model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"  # Fine-tuned on sentiment analysis
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load IMDb dataset with only 100 samples for quick evaluation
dataset = load_dataset("imdb", split="test[:100]")  # Using a subset for demonstration
"Done"

# Akurasi LLM pada kasus Sentimen Analisis

In [None]:
# Initialize a pipeline for sentiment analysis
sentiment_analyzer = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Prepare predictions and labels
pred_labels = []
true_labels = []

for example in dataset:
    # Perform inference
    result = sentiment_analyzer(example["text"])[0]
    pred_label = 1 if result["label"] == "POSITIVE" else 0  # Map to binary label
    true_label = example["label"]  # IMDb dataset uses 0 for negative, 1 for positive

    # Append predictions and true labels
    pred_labels.append(pred_label)
    true_labels.append(true_label)

# Calculate accuracy
accuracy = accuracy_score(true_labels, pred_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")

<center><h2><strong><font color="blue">Contoh 02: BLEU / ROUGE</font></strong></h2></center>

* BLEU and ROUGE scores, which are commonly used for tasks like machine translation, summarization, and text generation.
* In this example, we’ll evaluate a model’s text summarization capability using the CNN/DailyMail dataset.
* We’ll use Hugging Face’s transformers library to load a pre-trained summarization model and datasets library to load the dataset.
* The evaluation metrics BLEU and ROUGE will be calculated using nltk and rouge_score.

In [None]:
from transformers import pipeline
from datasets import load_dataset
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import nltk
nltk.download("punkt")

# Loading Model & Dataset

In [None]:
# Load pre-trained summarization model
model_name = "facebook/bart-large-cnn"
summarizer = pipeline("summarization", model=model_name)

# Load a sample dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="test[:10]")  # Using a small subset for quick evaluation

# Evaluasi Summary menggunakan metric BLEU / ROUGE

In [None]:
# Initialize ROUGE scorer
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Prepare evaluation metrics
bleu_scores = []
rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}

# Evaluate each example
for sample in dataset:
    # Generate summary with the model
    generated_summary = summarizer(sample["article"], max_length=50, min_length=25, do_sample=False)[0]["summary_text"]
    
    # Reference summary from the dataset
    reference_summary = sample["highlights"]
    
    # Compute BLEU score
    reference_tokens = nltk.word_tokenize(reference_summary)
    generated_tokens = nltk.word_tokenize(generated_summary)
    bleu = sentence_bleu([reference_tokens], generated_tokens)
    bleu_scores.append(bleu)
    
    # Compute ROUGE scores
    rouge_score = rouge.score(reference_summary, generated_summary)
    for key in rouge_scores.keys():
        rouge_scores[key].append(rouge_score[key].fmeasure)

# Calculate average scores
average_bleu = sum(bleu_scores) / len(bleu_scores)
average_rouge = {key: sum(values) / len(values) for key, values in rouge_scores.items()}

# Print results
print(f"Average BLEU score: {average_bleu:.4f}")
print("Average ROUGE scores:")
for key, score in average_rouge.items():
    print(f"  {key}: {score:.4f}")

<center><h2><strong><font color="blue">Contoh 03: Perplexity/Log Loss</font></strong></h2></center>

* Perplexity and log loss are common metrics for evaluating the quality of language models, especially for tasks like language modeling where the model predicts the next word or token.
* In this example, we’ll use Hugging Face’s transformers library to load a pre-trained GPT-2 model and a small dataset to calculate these metrics.

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from datasets import load_dataset
import math

# Loading Model & Dataset

In [None]:
# Load a pre-trained language model
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Load a small text dataset for evaluation
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test[:5%]")  # Using a small subset for demo

# Evaluasi Perplexity dan Log Loss

In [None]:
# Put the model in evaluation mode
model.eval()

# Initialize variables to track log loss and token count
total_log_loss = 0
total_tokens = 0

# Calculate log loss and perplexity
for sample in dataset:
    # Tokenize the input text
    inputs = tokenizer(sample["text"], return_tensors="pt", truncation=True, max_length=512)
    
    # Shift labels to align with prediction
    inputs["labels"] = inputs["input_ids"].clone()
    
    # Forward pass with no gradient calculation
    with torch.no_grad():
        outputs = model(**inputs)
        log_likelihood = outputs.loss  # Log loss

    # Accumulate total log loss and token count
    batch_log_loss = log_likelihood.item() * inputs["input_ids"].size(1)  # Log loss for this batch
    total_log_loss += batch_log_loss
    total_tokens += inputs["input_ids"].size(1)

# Calculate average log loss
average_log_loss = total_log_loss / total_tokens

# Calculate perplexity
perplexity = math.exp(average_log_loss)

# Print the results
print(f"Average Log Loss: {average_log_loss:.4f}")
print(f"Perplexity: {perplexity:.4f}")