<a href="https://colab.research.google.com/github/xandreiAThome/machine-translation-nlp1k/blob/main/nmt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Machine Translation

## Preprocess

Load the aligned verses from the tsv, clean the string from any non alphabetic characters. Remove any verses that have no verse for either of the two language, and use the class from the datasets library to structure the data and be ready for training.

In [1]:
import regex as re

def clean_string(input_string):
    cleaned = re.sub(r"[^\p{L}\s]", "", input_string.strip().lower())
    return cleaned

def process(example):
    src = example["src"].strip()
    tgt = example["tgt"].strip()

    # skip invalid pairs
    if src.lower() == "<no verse>" or tgt.lower() == "<no verse>":
        return {"src": None, "tgt": None}

    return {
        "src": clean_string(src),
        "tgt": clean_string(tgt),
    }

In [2]:
# LANGUAGE CONFIGURATION (also the name of the columns in the dataset)
SRC_LANG = "Pangasinan"
TGT_LANG = "Bikolano"

In [3]:
!ls /kaggle/input

bikolano-pangasinan-parallel  bikolano-tagalog-parallel


In [4]:
from datasets import load_dataset

# DATASET CONFIGURATION 
DATASET_PATH = "/kaggle/input/bikolano-pangasinan-parallel/Bikolano_Pangasinan_Parallel.tsv"
DATASET_DELIMITER = "\t"
DATASET_SPLIT = "train"

dataset = load_dataset(
    "csv",
    data_files=DATASET_PATH,
    delimiter=DATASET_DELIMITER,
)

dataset = dataset[DATASET_SPLIT].select_columns([SRC_LANG, TGT_LANG])
dataset = dataset.rename_columns({SRC_LANG: "src", TGT_LANG: "tgt"})
initial_dataset_length = len(dataset)

dataset = dataset.map(process)

dataset = dataset.filter(lambda x: x["src"] is not None and x["tgt"] is not None)

skipped = initial_dataset_length - len(dataset)
print(f"skipped verses: {skipped}")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/30028 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30028 [00:00<?, ? examples/s]

skipped verses: 41


Lets look at the first 5 aligned verses

In [5]:
display(dataset[:5])

{'src': ['si adan so ama nen set ya ama nen enos tan si enos so ama nen kenan',
  'si kenan so ama nen mahalalel ya ama nen jared',
  'tan si jared so ama nen enoc ya ama nen matusalem si matusalem so ama nen lamec',
  'ya ama nen noe taloran lalaki so anak nen noe di sem ham tan jafet',
  'saray lalakin anak nen jafet sikara di gomer magog madai javan tubal mesec tan tiras'],
 'tgt': ['si adan iyo an ama ni set asin si set iyo an ama ni enos na ama ni kenan',
  'si kenan iyo an ama ni mahalalel na ama ni jared',
  'si jared iyo an ama ni enoc na ama ni metusela si metusela iyo an ama ni lamec',
  'na iyo an ama ni noe si noe nagkaigwa nin tolong aking lalaki na iyo si sem ham asin si jafet',
  'an mga aking lalaki ni jafet iyo si gomer magog madai javan tubal mesec asin tiras']}

## Setting up Trainer
We will use facebook's No Language Left Behind Model as the base model to fine tune using our dataset. It is performant even on low resource languages thats why our group decided to use it.

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# MODEL CONFIGURATION 
BASE_MODEL_NAME = "facebook/nllb-200-distilled-600M"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

2025-11-17 23:43:50.350680: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763423030.529559      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763423030.578368      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [7]:
# TOKENIZATION CONFIGURATION 
MAX_LENGTH = 128

def tokenize(batch):
    model_inputs = tokenizer(batch["src"], truncation=True, max_length=MAX_LENGTH)
    labels = tokenizer(batch["tgt"], truncation=True, max_length=MAX_LENGTH).input_ids
    model_inputs["labels"] = labels
    return model_inputs

tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/29987 [00:00<?, ? examples/s]

Let us split the training data to also have a dataset for evaluation after training.

In [8]:
split = tokenized_dataset.train_test_split(test_size=0.1)
train_data = split["train"]
eval_data = split["test"]
# TRAINING CONFIGURATION 
RUN_NAME = "nllb-pag-bcl"
OUTPUT_PATH = f"/kaggle/tmp/{RUN_NAME}"

In [10]:
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer

# TRAINING HYPERPARAMETERS 
BATCH_SIZE = 4
LEARNING_RATE = 5e-5
NUM_EPOCHS = 6
LOGGING_STEPS = 50
GRADIENT_ACCUMULATION_STEPS = 2  # effective batch size = 8
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIMIT = 2
USE_FP16 = True

training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_PATH,
    run_name=RUN_NAME,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    num_train_epochs=NUM_EPOCHS,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=LOGGING_STEPS,
    fp16=USE_FP16,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    weight_decay=WEIGHT_DECAY,
    predict_with_generate=True,
    save_total_limit=SAVE_TOTAL_LIMIT,
    report_to=[],
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Seq2SeqTrainer(


In [11]:
import torch
print("CUDA available?", torch.cuda.is_available())
print("Device:", torch.cuda.current_device())
print("Device name:", torch.cuda.get_device_name(torch.cuda.current_device()))

CUDA available? True
Device: 0
Device name: Tesla P100-PCIE-16GB


In [12]:
print("starting training")
trainer.train()
trainer.save_model(OUTPUT_PATH)
tokenizer.save_pretrained(OUTPUT_PATH)


starting training


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,1.6112,1.49489
2,1.367,1.332612
3,1.163,1.265801
4,1.0715,1.234605
5,0.9781,1.222778
6,0.9407,1.224769




('/kaggle/tmp/nllb-pag-bcl/tokenizer_config.json',
 '/kaggle/tmp/nllb-pag-bcl/special_tokens_map.json',
 '/kaggle/tmp/nllb-pag-bcl/sentencepiece.bpe.model',
 '/kaggle/tmp/nllb-pag-bcl/added_tokens.json',
 '/kaggle/tmp/nllb-pag-bcl/tokenizer.json')

In [13]:
!ls /kaggle/tmp

nllb-pag-bcl


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [15]:
!zip -r /kaggle/working/nllb-pag-bcl.zip /kaggle/tmp/nllb-pag-bcl

  adding: kaggle/tmp/nllb-pag-bcl/ (stored 0%)
  adding: kaggle/tmp/nllb-pag-bcl/generation_config.json (deflated 34%)
  adding: kaggle/tmp/nllb-pag-bcl/special_tokens_map.json (deflated 79%)
  adding: kaggle/tmp/nllb-pag-bcl/training_args.bin (deflated 52%)
  adding: kaggle/tmp/nllb-pag-bcl/config.json (deflated 57%)
  adding: kaggle/tmp/nllb-pag-bcl/tokenizer.json

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


 (deflated 82%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/ (stored 0%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/trainer_state.json (deflated 78%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/generation_config.json (deflated 34%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/special_tokens_map.json (deflated 79%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/optimizer.pt (deflated 8%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/rng_state.pth (deflated 25%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/training_args.bin (deflated 52%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/config.json (deflated 57%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/tokenizer.json (deflated 82%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/sentencepiece.bpe.model (deflated 51%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/tokenizer_config.json (deflated 94%)
  adding: kaggle/tmp/nllb-pag-bcl/checkpoint-20244/scheduler.pt (deflated 55%

In [16]:
 %cd /kaggle/working

/kaggle/working


In [17]:
from IPython.display import FileLink
FileLink(r'nllb-pag-bcl.zip')

## Evaluate Model on Bikolano to Tagalog Translation

Load the trained checkpoint and evaluate its translation quality on the dataset.

In [18]:
# Load the trained model and tokenizer from checkpoint
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# CHECKPOINT CONFIGURATION 
CHECKPOINT_PATH = "/kaggle/tmp/nllb-pag-bcl"

tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_PATH)
model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT_PATH)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

print(f"Using device: {device}")

Using device: cuda


In [19]:
from datasets import load_dataset

# ===== TRANSLATION FUNCTION =====
def translate(text, model_tokenizer, translation_model, src_lang=SRC_LANG, tgt_lang=TGT_LANG):
    # Tokenize input text
    inputs = model_tokenizer(text, return_tensors="pt", max_length=MAX_LENGTH, truncation=True).to(device)
    
    # Generate translation
    with torch.no_grad():
        outputs = translation_model.generate(
            **inputs,
            max_length=GENERATION_MAX_LENGTH,
            num_beams=NUM_BEAMS,
            early_stopping=True
        )
    
    translation = model_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

# ===== EVALUATION DATASET CONFIGURATION =====
EVAL_DATASET_PATH = "/kaggle/input/bikolano-pangasinan-parallel/Bikolano_Pangasinan_Parallel.tsv"

# Load the original Bikolano-Tagalog dataset
dataset = load_dataset(
    "csv",
    data_files=EVAL_DATASET_PATH,
    delimiter="\t",
)

dataset = dataset["train"].select_columns(["Pangasinan", "Bikolano"])
dataset = dataset.rename_columns({"Pangasinan": "src", "Bikolano": "tgt"})

# Apply the same cleaning function as before
dataset = dataset.map(process)
dataset = dataset.filter(lambda x: x["src"] is not None and x["tgt"] is not None)

print(f"Total dataset size: {len(dataset)}")

Total dataset size: 29987


In [21]:
%pip install sacrebleu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, sacrebleu
Successfully installed portalocker-3.2.0 sacrebleu-2.5.1
Note: you may need to restart the kernel to use updated packages.


In [22]:
# Evaluate on test set
import sacrebleu
from tqdm import tqdm
import numpy as np

# ===== EVALUATION CONFIGURATION =====
EVAL_SIZE = 100
GENERATION_MAX_LENGTH = 128
NUM_BEAMS = 5

# Get a sample from the dataset for evaluation
eval_size = min(EVAL_SIZE, len(dataset))
eval_dataset = dataset.select(range(eval_size))

predictions = []
references = []

print("Generating translations for evaluation...")
for i, example in enumerate(tqdm(eval_dataset, total=eval_size)):
    src_text = example["src"]
    ref_text = example["tgt"]
    
    pred_text = translate(src_text, tokenizer, model, SRC_LANG, TGT_LANG)
    
    predictions.append(pred_text)
    references.append(ref_text)

def calculate_bleu(predictions, references):
    """Calculate corpus BLEU score"""
    # sacrebleu expects predictions as list of strings and references as list of list of strings
    refs = [[ref] for ref in references]
    return sacrebleu.corpus_bleu(predictions, refs)

bleu_score = calculate_bleu(predictions, references)
print(f"\nBLEU Score: {bleu_score.score:.4f}")

Generating translations for evaluation...


100%|██████████| 100/100 [00:39<00:00,  2.51it/s]


BLEU Score: 85.0733





## Compare with Base NLLB Model

Let us evaluate the base NLLB model on the same test set to compare performance.
Note that this can only be used when translating to Tagalog as the NLBB model only trained is only trained in Tagalog among the Philippine Languages. But is also trained in other South East Asian Languages among the 200 language dataset.

In [23]:
# Load base NLLB model for comparison
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

print("Loading base NLLB model...")
base_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)
base_model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_NAME)

base_model = base_model.to(device)
base_model.eval()

print(f"Base model loaded: {BASE_MODEL_NAME}")

def translate_base_model(text, target_lang_code="tgl_Latn", model_tokenizer=None, multilingual_model=None):
    """ 
    Common language codes:
        - tgl_Latn: Tagalog
        - eng_Latn: English
        - spa_Latn: Spanish
        - fra_Latn: French
        - deu_Latn: German
        - cmn_Hans: Mandarin Chinese
        - jpn_Jpan: Japanese
    """
    if model_tokenizer is None:
        model_tokenizer = base_tokenizer
    if multilingual_model is None:
        multilingual_model = base_model
    
    inputs = model_tokenizer(text, return_tensors="pt", max_length=MAX_LENGTH, truncation=True).to(device)
    
    # Force the target language
    forced_bos_token_id = model_tokenizer.convert_tokens_to_ids(target_lang_code)
    
    with torch.no_grad():
        outputs = multilingual_model.generate(
            **inputs,
            max_length=GENERATION_MAX_LENGTH,
            num_beams=NUM_BEAMS,
            early_stopping=True,
            forced_bos_token_id=forced_bos_token_id
        )
    
    translation = model_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

# Generate predictions with base model
print("\nGenerating translations with base model...")
base_predictions = []

for example in tqdm(eval_dataset, total=eval_size, desc="Base model"):
    pred_text = translate_base_model(example["src"])
    base_predictions.append(pred_text)

# Calculate BLEU score for base model
base_bleu_score = calculate_bleu(base_predictions, references)

print("\n" + "=" * 80)
print("BLEU SCORE COMPARISON")
print("=" * 80)
print(f"Base NLLB Model:        {base_bleu_score.score:.4f}")
print(f"Fine-tuned Model:       {bleu_score.score:.4f}")
print(f"Improvement:            {bleu_score.score - base_bleu_score.score:+.4f}")
print("=" * 80)

Loading base NLLB model...
Base model loaded: facebook/nllb-200-distilled-600M

Generating translations with base model...


Base model: 100%|██████████| 100/100 [00:40<00:00,  2.47it/s]


BLEU SCORE COMPARISON
Base NLLB Model:        2.4075
Fine-tuned Model:       85.0733
Improvement:            +82.6658





In [24]:

import pandas as pd

comparison_df = pd.DataFrame({
    "Source (Pangasinan)": [eval_dataset[i]['src'] for i in range(len(eval_dataset))],
    "Reference (Bikolano)": references,
    "Base Model Output": base_predictions,
    "Fine-tuned Model Output": predictions
})

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("FULL TRANSLATION COMPARISON")
display(comparison_df.head(10))

comparison_df.to_csv("translation_comparison.csv", index=False)
print("\nComparison saved to: translation_comparison.csv")

FULL TRANSLATION COMPARISON


Unnamed: 0,Source (Pangasinan),Reference (Bikolano),Base Model Output,Fine-tuned Model Output
0,si adan so ama nen set ya ama nen enos tan si enos so ama nen kenan,si adan iyo an ama ni set asin si set iyo an ama ni enos na ama ni kenan,Si Adan so in love nen set you si Adan so in love nen Nan Nan Nan Nan Nan,si adan iyo an ama ni set na ama ni enos na ama ni kenan
1,si kenan so ama nen mahalalel ya ama nen jared,si kenan iyo an ama ni mahalalel na ama ni jared,Si kenan so ama nen mahalalel ya ama nen jared,si kenan iyo an ama ni mahalalel na ama ni jared
2,tan si jared so ama nen enoc ya ama nen matusalem si matusalem so ama nen lamec,si jared iyo an ama ni enoc na ama ni metusela si metusela iyo an ama ni lamec,Tan si jared kaya mahal ko si enoc kaya mahal ko si matusalem si matusalem kaya mahal ko si lamec,si jared iyo an ama ni enoc na ama ni metusela si metusela iyo an ama ni lamec
3,ya ama nen noe taloran lalaki so anak nen noe di sem ham tan jafet,na iyo an ama ni noe si noe nagkaigwa nin tolong aking lalaki na iyo si sem ham asin si jafet,I love no taloran lalaki kaya anak no no sa sem ham tan jafet,na ama ni noe nagkaigwa si noe nin tolong aking lalaki si sem si ham asin si jafet
4,saray lalakin anak nen jafet sikara di gomer magog madai javan tubal mesec tan tiras,an mga aking lalaki ni jafet iyo si gomer magog madai javan tubal mesec asin tiras,saray lalakin anak nen jafet sikara di gomer magog madai javan tubal buwan tan tiras,an mga aking lalaki ni jafet iyo si gomer magog madai javan tubal mesek asin tiras
5,saray lalakin anak nen gomer sikara di askenaz rifat tan togarma,an mga aking lalaki ni gomer iyo si askenaz rifat asin togarma,saray lalakin anak nen gomer sikara di askenaz rifat tan togarma,an mga aking lalaki ni gomer iyo si askenaz rifat asin togarma
6,saray lalakin anak nen javan sikara di elisa espanya chipre tan dodanim,an mga aking lalaki ni javan iyo si elisa espanya chipre asin rodas,saray lalakin anak nen javan sikara di elisa spanish chipre tan dodanim,an mga aking lalaki ni javan iyo si elisa espanya chipre asin dodanim
7,saray lalakin anak nen ham sikara di cus egipto libia tan canaan,an mga aking lalaki ni ham iyo si cus egipto libya asin canaan,Saray lalakin anak nen ham sikara di cus Ehipto Libya at Canaan,an mga aking lalaki ni ham iyo si cus egipto libya asin canaan
8,saray lalakin anak nen cus sikara di seba havila sabta raama tan sabteca saray lalakin anak nen raama sikara di saba tan dedan,an mga aking lalaki ni cus iyo si seba havila sabta raama asin sabteca an mga aking lalaki ni raama iyo an mga tawo sa sheba asin dedan,Saray lalakin anak nen cus sikara di seba havila sabta raama tan sabteca saray lalakin anak nen raama sikara di saba tan dedan,an mga aking lalaki ni cus iyo si sheba havila sabta raama asin sabteca an mga aking lalaki ni raama iyo si sheba asin dedan
9,walay lakin anak nen cus a manngaran na nimrod a sikatoy inmonan bantug a makapanyari diad tapew na dalin,si cus iyo an ama ni nimrod nagpoon ining mabantog bilang sarong mapangyaring mananakop kan kinaban,wala lakin anak nen cus isang manngaran na nimrod isang sikatoy inmonan bantug isang makapanyari diad tapew na dalin,si cus iyo an ama ni nimrod na iyo an enot na bantog na nagin mapangyari sa kinaban



Comparison saved to: translation_comparison.csv
