<a href="https://colab.research.google.com/github/xandreiAThome/machine-translation-nlp1k/blob/main/nmt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Machine Translation

## Preprocess

Load the aligned verses from the tsv, clean the string from any non alphabetic characters. Remove any verses that have no verse for either of the two language, and use the class from the datasets library to structure the data and be ready for training.

In [1]:
import regex as re

def clean_string(input_string):
    cleaned = re.sub(r"[^\p{L}\s]", "", input_string.strip().lower())
    return cleaned

def process(example):
    src = example["src"].strip()
    tgt = example["tgt"].strip()

    # skip invalid pairs
    if src.lower() == "<no verse>" or tgt.lower() == "<no verse>":
        return {"src": None, "tgt": None}

    return {
        "src": clean_string(src),
        "tgt": clean_string(tgt),
    }

In [2]:
# Change name to the column in the tsv
src_lang = "Bikolano"
target_lang = "Tagalog"

In [3]:
    !ls /kaggle/input

bikolano-tagalog-parallel


In [4]:
from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files="/kaggle/input/bikolano-tagalog-parallel/Bikolano_Tagalog_Parallel.tsv",
    delimiter="\t",
)

dataset = dataset["train"].select_columns([src_lang, target_lang])
dataset = dataset.rename_columns({src_lang: "src", target_lang: "tgt"})

# Get initial dataset length
initial_dataset_length = len(dataset)

dataset = dataset.map(process)

# remove rows with None (invalid)
dataset = dataset.filter(lambda x: x["src"] is not None and x["tgt"] is not None)

# Calculate skipped verses
skipped = initial_dataset_length - len(dataset)
print(f"skipped verses: {skipped}")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/33900 [00:00<?, ? examples/s]

Filter:   0%|          | 0/33900 [00:00<?, ? examples/s]

skipped verses: 27


Lets look at the first 5 aligned verses

In [5]:
display(dataset[:5])

{'src': ['si adan iyo an ama ni set asin si set iyo an ama ni enos na ama ni kenan',
  'si kenan iyo an ama ni mahalalel na ama ni jared',
  'si jared iyo an ama ni enoc na ama ni metusela si metusela iyo an ama ni lamec',
  'na iyo an ama ni noe si noe nagkaigwa nin tolong aking lalaki na iyo si sem ham asin si jafet',
  'an mga aking lalaki ni jafet iyo si gomer magog madai javan tubal mesec asin tiras'],
 'tgt': ['sina adan set enos',
  'kenan mahalalel jared',
  'enoc matusalem lamec',
  'noe sem ham at jafet',
  'ang mga anak ni jafet ay sina gomer magog madai javan tubal meshec at tiras']}

## Setting up Trainer
We will use facebook's No Language Left Behind Model as the base model to fine tune using our dataset. It is performant even on low resource languages thats why our group decided to use it.

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

2025-11-15 11:41:51.226663: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763206911.408539      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763206911.454820      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [7]:
def tokenize(batch):
    model_inputs = tokenizer(batch["src"], truncation=True, max_length=128)
    labels = tokenizer(batch["tgt"], truncation=True, max_length=128).input_ids
    model_inputs["labels"] = labels
    return model_inputs

tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/33873 [00:00<?, ? examples/s]

Let us split the training data to also have a dataset for evaluation after training.

In [8]:
split = tokenized_dataset.train_test_split(test_size=0.1)
train_data = split["train"]
eval_data = split["test"]
# Change name to the selected languages
run_name="nllb-bcl-tgl"
output_path=f"/kaggle/tmp/{run_name}"

In [9]:
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir=output_path,
    run_name=run_name,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=5e-5,
    num_train_epochs=6,
    eval_strategy="epoch", # Changed from evaluate_during_training
    save_strategy="epoch",
    logging_steps=50,
    fp16=True,
    gradient_accumulation_steps=2,  # effective batch size = 8
    weight_decay=0.01,
    predict_with_generate=True,
    save_total_limit=2,
    report_to=[],
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Seq2SeqTrainer(


In [10]:
import torch
print("CUDA available?", torch.cuda.is_available())
print("Device:", torch.cuda.current_device())
print("Device name:", torch.cuda.get_device_name(torch.cuda.current_device()))

CUDA available? True
Device: 0
Device name: Tesla P100-PCIE-16GB


In [None]:
print("starting training")
trainer.train()
trainer.save_model(output_path)
tokenizer.save_pretrained(output_path)


starting training


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss


In [None]:
zip -r /kaggle/working/nllb-bcl-tgl.zip /kaggle/temp/nllb-bcl-tgl