<center><h2><strong><font color="blue">WFH 2024 idBigData - Evaluasi LLM</font></strong></h2></center>

<img alt="" src="https://github.com/taudataanalytics/WFH-idBigData-2024/blob/main/images/covers/cover_taudata_uin.jpg?raw=1"/>

# Jangan lupa mengganti Runtime menjadi GPU di Google Colab

<center><h2><strong><font color="blue">Contoh 01: Akurasi</font></strong></h2></center>

* small public LLM, like DistilBERT from Hugging Face’s Transformers library, on a text classification task.
* the IMDb movie review dataset for binary sentiment classification

In [8]:
import warnings; warnings.simplefilter('ignore')
from tqdm import tqdm

try:
    import google.colab; IN_COLAB = True
    print("Installing the required modules")
    #!pip install transformers datasets sklearn --q
    !pip install rouge-score --q
    print("preparing directories and assets")
    #!mkdir data images output models
    #!wget https://raw.githubusercontent.com/taudata...
except:
    IN_COLAB = False
    print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score

# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Installing the required modules
preparing directories and assets
Using device: cuda


# Loading Model & Dataset

In [4]:
# Load the pre-trained model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)

# Load IMDb dataset and use a subset for quick evaluation
dataset = load_dataset("imdb", split="test[:100]")  # Using only 100 samples for demonstration
"Done"

'Done'

# Akurasi LLM pada kasus Sentimen Analisis

In [5]:
# Initialize a sentiment analysis pipeline with the loaded model and tokenizer
sentiment_analyzer = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)

# Prepare predictions and labels
pred_labels = []
true_labels = []

# Evaluate each sample in the dataset
for example in dataset:
    # Tokenize text and truncate to model's max length
    inputs = tokenizer(example["text"], truncation=True, padding="max_length", max_length=512, return_tensors="pt")
    inputs = {key: val.to(device) for key, val in inputs.items()}  # Move tensors to GPU

    # Perform inference
    with torch.no_grad():
        result = model(**inputs)

    # Get predicted label (0 for negative, 1 for positive)
    pred_label = torch.argmax(result.logits, dim=1).item()
    true_label = example["label"]  # IMDb dataset uses 0 for negative, 1 for positive

    # Append predictions and true labels
    pred_labels.append(pred_label)
    true_labels.append(true_label)

# Calculate accuracy
accuracy = accuracy_score(true_labels, pred_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 89.00%


<center><h2><strong><font color="blue">Contoh 02: BLEU / ROUGE</font></strong></h2></center>

* BLEU and ROUGE scores, which are commonly used for tasks like machine translation, summarization, and text generation.
* In this example, we’ll evaluate a model’s text summarization capability using the CNN/DailyMail dataset.
* We’ll use Hugging Face’s transformers library to load a pre-trained summarization model and datasets library to load the dataset.
* The evaluation metrics BLEU and ROUGE will be calculated using nltk and rouge_score.

In [9]:
from transformers import pipeline
from datasets import load_dataset
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Loading Model & Dataset

In [10]:
# Load pre-trained summarization model
model_name = "facebook/bart-large-cnn"
summarizer = pipeline("summarization", model=model_name)

# Load a sample dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="test[:10]")  # Using a small subset for quick evaluation

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

# Evaluasi Summary menggunakan metric BLEU / ROUGE

In [11]:
# Initialize ROUGE scorer
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Prepare evaluation metrics
bleu_scores = []
rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}

# Evaluate each example
for sample in tqdm(dataset):
    # Generate summary with the model
    generated_summary = summarizer(sample["article"], max_length=50, min_length=25, do_sample=False)[0]["summary_text"]

    # Reference summary from the dataset
    reference_summary = sample["highlights"]

    # Compute BLEU score
    reference_tokens = nltk.word_tokenize(reference_summary)
    generated_tokens = nltk.word_tokenize(generated_summary)
    bleu = sentence_bleu([reference_tokens], generated_tokens)
    bleu_scores.append(bleu)

    # Compute ROUGE scores
    rouge_score = rouge.score(reference_summary, generated_summary)
    for key in rouge_scores.keys():
        rouge_scores[key].append(rouge_score[key].fmeasure)

# Calculate average scores
average_bleu = sum(bleu_scores) / len(bleu_scores)
average_rouge = {key: sum(values) / len(values) for key, values in rouge_scores.items()}

# Print results
print(f"Average BLEU score: {average_bleu:.4f}")
print("Average ROUGE scores:")
for key, score in average_rouge.items():
    print(f"  {key}: {score:.4f}")

Average BLEU score: 0.1455
Average ROUGE scores:
  rouge1: 0.4265
  rouge2: 0.2140
  rougeL: 0.3293
