# CSC420M Assignment 5: Finetuning Large Language Models
Enrique Lejano

You will finetune 3 models for Named Entity Recognition (NER) on a TLUnified-NER dataset subset, identifying PERSON, ORGANIZATION, and LOCATION entities in Tagalog texts.  The 3 models to be finetuned are the following:

- A Tagalog-pretrained model from jcblaise (local, e.g., jcblaise/distilbert-tagalog-base-uncased or jcblaise/roberta-tagalog-base)
- SEA-LION (regional, e.g., aisingapore/sealion7b-instruct)
- An open-source LLM, such as Gemma (multilingual, e.g., google/gemma-2b).

## Environment Setup

Install Dependencies

In [1]:
# Insert pip install here
%pip install --upgrade transformers datasets seqeval accelerate --quiet

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Only for my windows machine.
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Looking in indexes: https://download.pytorch.org/whl/cu128
Note: you may need to restart the kernel to use updated packages.


Use CUDA with GPU if available

In [1]:
import torch
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

In [2]:
from dotenv import load_dotenv

# No need to login because token already present in .env
load_dotenv()

False

Select the three model checkpoints to be used.
1. RoBERTa Tagalog
2. Llama SEA-LION
3. DistilBERT

In [3]:
roberta_tl_checkpoint = 'jcblaise/roberta-tagalog-base'
sealion_checkpoint = 'aisingapore/Llama-SEA-LION-v3.5-8B-R'
distilbert_checkpoint = 'distilbert-base-multilingual-cased'

## Load TLUnified-NER Dataset

In [4]:
from pandas import DataFrame
from datasets import load_dataset

ds = load_dataset("ljvmiranda921/tlunified-ner", "default")
df = DataFrame(ds['train'])
df.head()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,id,tokens,ner_tags
0,0,"[Sinabi, nito, na, lantad, ang, ginagawang, pa...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, ..."
1,1,"[Kasama, rin, kasama, sa, ban, ang, mga, priba...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,2,"["", Our, peace, advisers, informed, the, Inter...","[0, 0, 0, 0, 0, 0, 3, 4, 4, 0, 0, 0, 3, 4, 4, ..."
3,3,"[Ayon, kay, Villavicencio, ,, kung, susuriin, ...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 5, 6, 0, 0, ..."
4,4,"[Sa, ulat, ng, dzBB, radio, nitong, Huwebes, ,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, ..."


Classification Label Equivalents: 

`0` : O 

`1` : `B-PER` -> Beginning of person name

`2` : `I-PER` -> Inside person name

`3` : `B-ORG` -> Beginning of organization

`4` : `I-ORG` -> Inside an organization

`5` : `B-LOC` -> Beginning of location

`6` : `I-LOC` -> Inside of location

## Data Preprocessing

In [5]:
label_list = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}

Create general tokenization and alignment code.

In [6]:
from transformers import AutoTokenizer
from functools import partial

def tokenize_and_align_labels(examples, tokenizer, max_length=512):
    tokenized_inputs = tokenizer(
        examples["tokens"], 
        truncation=True, 
        is_split_into_words=True,
        padding=True,  # Add padding
        max_length=max_length,
        return_tensors=None  # Don't convert to tensors yet
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        
        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens get -100
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # First token of a word gets the label
                label_ids.append(label[word_idx])
            else:
                # Subsequent tokens of the same word get -100
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

### Tokenization for `RoBERTa Tagalog` Model 

In [7]:
# Tokenized and align labels first
roberta_tl_tokenizer = AutoTokenizer.from_pretrained(roberta_tl_checkpoint, add_prefix_space=True)
roberta_tokenized = partial(tokenize_and_align_labels, tokenizer=roberta_tl_tokenizer)

# Map tokenization function to dataset
roberta_tl_train = ds['train'].map(roberta_tokenized, batched=True, remove_columns=ds['train'].column_names)
roberta_tl_val = ds['validation'].map(roberta_tokenized, batched=True, remove_columns=ds['validation'].column_names)
roberta_tl_test = ds['test'].map(roberta_tokenized, batched=True, remove_columns=ds['test'].column_names)

### Tokenization for `SEA-LION`

In [8]:
sealion_tokenizer = AutoTokenizer.from_pretrained(sealion_checkpoint, add_prefix_space=True)
sealion_tokenized = partial(tokenize_and_align_labels, tokenizer=sealion_tokenizer)

# Map tokenization function to dataset
sealion_train = ds['train'].map(sealion_tokenized, batched=True, remove_columns=ds['train'].column_names)
sealion_val = ds['validation'].map(sealion_tokenized, batched=True, remove_columns=ds['validation'].column_names)
sealion_test = ds['test'].map(sealion_tokenized, batched=True, remove_columns=ds['test'].column_names)

### Tokenization for `DistilBERT Multilingual`

In [9]:
distilbert_tokenizer = AutoTokenizer.from_pretrained(distilbert_checkpoint, add_prefix_space=True)
distilbert_tokenized = partial(tokenize_and_align_labels, tokenizer=distilbert_tokenizer)

distilbert_train = ds['train'].map(distilbert_tokenized, batched=True, remove_columns=ds['train'].column_names)
distilbert_val = ds['validation'].map(distilbert_tokenized, batched=True, remove_columns=ds['validation'].column_names)
distilbert_test = ds['test'].map(distilbert_tokenized, batched=True, remove_columns=ds['test'].column_names)

## Finetuning Models

Define Metrics Computation

In [10]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification, AutoModelForTokenClassification
import evaluate

# Load seqeval metric for NER evaluation
seqeval = evaluate.load("seqeval")

# Compute metrics function using seqeval
def compute_metrics(p):
    predictions, labels = p
    predictions = predictions.argmax(-1)
    
    # Remove ignored index (special tokens) and convert to label names
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    # Compute seqeval metrics
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

### Finetuning `RoBERTa Tagalog` 

Load pretrained model for Token Classification

In [15]:
roberta_tl_model = AutoModelForTokenClassification.from_pretrained(
    roberta_tl_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

if device == "cuda":
    roberta_tl_model = roberta_tl_model.to('cuda')

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at jcblaise/roberta-tagalog-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define training arguments

In [16]:
# TODO: Understand the training arguments and the concepts of epochs, batch size, learning rate, etc.

# Data collator for token classification - this handles padding dynamically
data_collator = DataCollatorForTokenClassification(
    tokenizer=roberta_tl_tokenizer,
    padding=True,
    return_tensors="pt"
)

training_args = TrainingArguments(
    output_dir="./results/roberta-tagalog-ner",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    logging_dir="./logs",
    logging_steps=50,
    save_total_limit=1,
    seed=42,
    fp16=False,
    warmup_steps=100,
    dataloader_num_workers=0,  # Set to 0 to avoid multiprocessing issues
    remove_unused_columns=False,
    push_to_hub=False,
    dataloader_pin_memory=False,  # Set to False for M1
    gradient_accumulation_steps=4,
    do_eval=True,
)

# Initialize trainer
roberta_tl_trainer = Trainer(
    model=roberta_tl_model,
    args=training_args,
    train_dataset=roberta_tl_train,
    eval_dataset=roberta_tl_val,
    tokenizer=roberta_tl_tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  roberta_tl_trainer = Trainer(


Training `RoBERTa Tagalog` model 

In [52]:
# Start training
print("Starting training...")
roberta_tl_trainer.train()

# Save the model
print("Saving model...")
roberta_tl_trainer.save_model()
roberta_tl_tokenizer.save_pretrained("./results/roberta-tagalog-ner")
print("Training completed!")

Starting training...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0668,0.057932,0.873722,0.901365,0.887328,0.982967
2,0.0494,0.058897,0.860737,0.912531,0.885878,0.98193
3,0.0371,0.057458,0.880767,0.911911,0.896068,0.983446


Saving model...
Training completed!


### Finetuning `DistilBERT Multilingual Cased`

In [17]:
distilbert_model = AutoModelForTokenClassification.from_pretrained(
    distilbert_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

if device == "cuda":
    distilbert_model = distilbert_model.to('cuda')

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
# Data collator for token classification - this handles padding dynamically
distilbert_data_collator = DataCollatorForTokenClassification(
    tokenizer=distilbert_tokenizer,
    padding=True, 
    return_tensors="pt"
)
   
training_args = TrainingArguments(
    output_dir="./results/distilbert-multilingual-cased-ner",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    logging_dir="./logs",
    logging_steps=50,
    save_total_limit=1,
    seed=42,
    fp16=False,
    warmup_steps=100,
    dataloader_num_workers=0,  # Set to 0 to avoid multiprocessing issues
    remove_unused_columns=False,
    push_to_hub=False,
    dataloader_pin_memory=False,  # Set to False for M1
    gradient_accumulation_steps=4,
    do_eval=True,
)

# Initialize trainer
distilbert_trainer = Trainer(
    model=distilbert_model,
    args=training_args,
    train_dataset=distilbert_train,
    eval_dataset=distilbert_val,
    tokenizer=distilbert_tokenizer,
    data_collator=distilbert_data_collator,
    compute_metrics=compute_metrics,
)

  distilbert_trainer = Trainer(


In [55]:
# Start training
print("Starting training...")
distilbert_trainer.train()

# Save the model
print("Saving model...")
distilbert_trainer.save_model()
distilbert_tokenizer.save_pretrained("./results/distilbert-multilingual-cased-ner")
print("Training completed!")

Starting training...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0833,0.069936,0.844484,0.89268,0.867913,0.980055
2,0.0587,0.061374,0.871437,0.891439,0.881325,0.98201
3,0.0425,0.060278,0.870138,0.901985,0.885775,0.982688


Saving model...
Training completed!


### Finetuning `SEA-LION`

Load pretrained model for Token Classification

In [None]:
sealion_model = AutoModelForTokenClassification.from_pretrained(
    sealion_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

if device == "cuda":
    sealion_model = sealion_model.to('cuda')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Define data collator and training arguments

In [None]:
# Data collator for token classification - this handles padding dynamically
sealion_data_collator = DataCollatorForTokenClassification(
    tokenizer=sealion_tokenizer,
    padding=True, 
    return_tensors="pt"
)
   
training_args = TrainingArguments(
    output_dir="./results/sealion-3.5-ner",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    logging_dir="./logs",
    logging_steps=50,
    save_total_limit=1,
    seed=42,
    fp16=False,
    warmup_steps=100,
    dataloader_num_workers=0,  # Set to 0 to avoid multiprocessing issues
    remove_unused_columns=False,
    push_to_hub=False,
    dataloader_pin_memory=False,  # Set to False for M1
    gradient_accumulation_steps=4,
    do_eval=True,
)

# Initialize trainer
sealion_trainer = Trainer(
    model=sealion_model,
    args=training_args,
    train_dataset=sealion_train['train'],
    eval_dataset=sealion_val['validation'],
    tokenizer=sealion_tokenizer,
    data_collator=sealion_data_collator,
    compute_metrics=compute_metrics,
)

Start training for SEALION model.

In [None]:
# Start training
print("Starting training...")
sealion_trainer.train()

# Save the model
print("Saving model...")
sealion_trainer.save_model()
sealion_tokenizer.save_pretrained("./results/sealion-3.5-ner")

print("Training completed!")

Starting training...


NameError: name 'sealion_trainer' is not defined

Load pretrained model

Define data collator and training arguments

## Model Evaluation and Comparison

Create helper function to evaluate models

In [12]:
def evaluate_finetuned_model(model_path, trainer, test_dataset, label_list, id2label, label2id):
    print(f'Loading finetuned model from: {model_path}')
    
    # Load model and tokenizer
    finetuned_model = AutoModelForTokenClassification.from_pretrained(
        model_path,
        num_labels=len(label_list),
        id2label=id2label,
        label2id=label2id
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)

    # TODO: switch top to cuda here.
    # Use MPS if possible, otherwise use CUDA or CPU depending on device compatibility
    if torch.cuda.is_available():
        finetuned_model = finetuned_model.to('cuda')
    elif torch.backends.mps.is_available():
        finetuned_model = finetuned_model.to('mps')
    else:
        finetuned_model = finetuned_model.to('cpu')

    # Update trainer
    trainer.model = finetuned_model
    trainer.processing_class = tokenizer

    print("Making predictions...")
    prediction_output = trainer.predict(test_dataset)

    # Extract predictions and labels correctly
    predictions = prediction_output.predictions.argmax(-1)
    labels = prediction_output.label_ids

    # Convert IDs to label names, ignoring -100 (special tokens)
    true_predictions = [
        [label_list[p] for p, l in zip(pred, lab) if l != -100]
        for pred, lab in zip(predictions, labels)
    ]

    true_labels = [
        [label_list[l] for p, l in zip(pred, lab) if l != -100]
        for pred, lab in zip(predictions, labels)
    ]

    print("Predictions complete!")
    
    return true_predictions, true_labels

Create helper function to display model metrics

In [13]:
import pandas as pd 

def print_predictions(true_predictions, true_labels, model_checkpoint):
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    total_results = 0

    # Extract entity-level results
    entity_data = []
    for entity in ['PER', 'ORG', 'LOC']:
        if entity in results:
            entity_data.append({
                'Entity': entity,
                'Precision': f"{results[entity]['precision']:.4f}",
                'Recall': f"{results[entity]['recall']:.4f}",
                'F1-Score': f"{results[entity]['f1']:.4f}",
                'Support': results[entity]['number']
            })
            total_results += results[entity]['number']

    # Add overall results
    entity_data.append({
        'Entity': 'OVERALL',
        'Precision': f"{results['overall_precision']:.4f}",
        'Recall': f"{results['overall_recall']:.4f}",
        'F1-Score': f"{results['overall_f1']:.4f}",
        'Support': f"{total_results}"
    })         

    # Create and display table
    df = pd.DataFrame(entity_data)
    print(f"Test Results for Model {model_checkpoint}:")
    print(df.to_string(index=False))
    print(f"\nOverall Accuracy: {results['overall_accuracy']:.4f}")

### `RoBERTa Tagalog` Prediction on Test Set 

In [19]:
true_predictions, true_labels = evaluate_finetuned_model(
    model_path='./results/roberta-tagalog-ner',
    trainer=roberta_tl_trainer,
    test_dataset=roberta_tl_test,
    label_list=label_list,
    id2label=id2label,
    label2id=label2id
)

Loading finetuned model from: ./results/roberta-tagalog-ner
Making predictions...


Predictions complete!


In [20]:
print_predictions(true_predictions, true_labels, model_checkpoint='RoBERTa Tagalog')

Test Results for Model RoBERTa Tagalog:
 Entity Precision Recall F1-Score Support
    PER    0.9314 0.9292   0.9303     833
    ORG    0.7625 0.8402   0.7995     363
    LOC    0.8502 0.9191   0.8833     383
OVERALL    0.8699 0.9063   0.8877    1579

Overall Accuracy: 0.9823


### `DistilBERT Multilingual Cased` Prediction on Test Set

In [21]:
true_predictions, true_labels = evaluate_finetuned_model(
    model_path='./results/distilbert-multilingual-cased-ner',
    trainer=distilbert_trainer,
    test_dataset=distilbert_test,
    label_list=label_list,
    id2label=id2label,
    label2id=label2id
)

Loading finetuned model from: ./results/distilbert-multilingual-cased-ner
Making predictions...


Predictions complete!


In [22]:
print_predictions(true_predictions, true_labels, model_checkpoint='DistilBERT Multilingual Cased')

Test Results for Model DistilBERT Multilingual Cased:
 Entity Precision Recall F1-Score Support
    PER    0.9158 0.9136   0.9147     833
    ORG    0.7214 0.8347   0.7739     363
    LOC    0.8700 0.9086   0.8889     383
OVERALL    0.8552 0.8942   0.8743    1579

Overall Accuracy: 0.9804
