### Machine Translation from English to isiXhosa

>A language model predicts the probability of the next word in a sequence, based on words already observed in the sequence.
>In this project, we will use the Xhosa dialogue dataset to develop a stastical language model that generates new text with similar statistical properties as the every day Xhosa conversation.

### Dataset Preparation
> We will start by preparing the data for modeling.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load dataset from https://hdl.handle.net/20.500.12185/525

# Download English file
!wget -O english.txt "https://repo.sadilar.org/bitstream/20.500.12185/525/2/Corpus.SADiLaR.English-isiXhosaDrop-Bilingual.1.0.0.CAM.2019-11-15.en.txt"

# Download isiXhosa file
!wget -O xhosa.txt "https://repo.sadilar.org/bitstream/20.500.12185/525/1/Corpus.SADiLaR.English-isiXhosaDrop-Bilingual.1.0.0.CAM.2019-11-15.xh.txt"


import pandas as pd
df_english = pd.read_csv('english.txt', sep='\t', header=None, names=['english_text'])
df_xhosa = pd.read_csv('xhosa.txt', sep='\t', header=None, names=['xhosa_text'])


print(df_english.head())
print(df_xhosa.head())

--2025-05-17 17:29:51--  https://repo.sadilar.org/bitstream/20.500.12185/525/2/Corpus.SADiLaR.English-isiXhosaDrop-Bilingual.1.0.0.CAM.2019-11-15.en.txt
Resolving repo.sadilar.org (repo.sadilar.org)... 143.160.47.167
Connecting to repo.sadilar.org (repo.sadilar.org)|143.160.47.167|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.sadilar.org/bitstreams/d3663756-58a1-4a53-8333-cb54ea3619d1/download [following]
--2025-05-17 17:29:53--  https://repo.sadilar.org/bitstreams/d3663756-58a1-4a53-8333-cb54ea3619d1/download
Reusing existing connection to repo.sadilar.org:443.
HTTP request sent, awaiting response... 302 Found
Location: https://repo.sadilar.org/server/api/core/bitstreams/d3663756-58a1-4a53-8333-cb54ea3619d1/content [following]
--2025-05-17 17:29:53--  https://repo.sadilar.org/server/api/core/bitstreams/d3663756-58a1-4a53-8333-cb54ea3619d1/content
Reusing existing connection to repo.sadilar.org:443.
HTTP request sent, awaiting r

In [None]:
import os

project_folder = '/content/drive/My Drive/machine_translation_xhosa'
os.makedirs(project_folder, exist_ok=True)

!cp english.txt "{project_folder}/english.txt"
!cp xhosa.txt "{project_folder}/xhosa.txt"

In [None]:
%cd /content/drive/My Drive/machine_translation_xhosa
%ls

/content/drive/My Drive/machine_translation_xhosa
english.txt  xhosa.txt


In [None]:
# Managing GitHub repo
username = "szinja"
token = "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"
repo = "language-translation"

!git clone https://{token}@github.com/{username}/{repo}.git

Cloning into 'language-translation'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 17 (delta 3), reused 14 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (17/17), 5.77 KiB | 1.44 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [None]:
# Move the language files into the new directory
!mv english.txt language-translation/
!mv xhosa.txt language-translation/

# Change into the language-translation directory and list files to verify
%cd language-translation
%ls
%cd ..

/content/drive/MyDrive/machine_translation_xhosa/language-translation
english.txt   README.md         results.ipynb  streamlit_app.py  xhosa.txt
inference.py  requirements.txt  setup.sh       train.py
/content/drive/MyDrive/machine_translation_xhosa


In [None]:
!pip install transformers datasets sentencepiece sacrebleu wandb

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.1.1-py3-none-any.whl (19 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.1.1 sacrebleu-2.5.1


In [None]:
import os
# Git synced directory
%cd /content/drive/MyDrive/machine_translation_xhosa/language-translation

# Load files
with open("english.txt", encoding="utf-8") as f:
    en_lines = [line.strip() for line in f.readlines()]

with open("xhosa.txt", encoding="utf-8") as f:
    xh_lines = [line.strip() for line in f.readlines()]

# Remove empty or mismatched lines
parallel = [(en, xh) for en, xh in zip(en_lines, xh_lines) if en and xh and len(en.split()) < 150 and len(xh.split()) < 150]

# Split into train/val/test (80/10/10)
from sklearn.model_selection import train_test_split

train, test = train_test_split(parallel, test_size=0.2, random_state=42)
val, test = train_test_split(test, test_size=0.5, random_state=42)

print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")


/content/drive/MyDrive/machine_translation_xhosa/language-translation
Train: 101356, Val: 12669, Test: 12670


In [None]:
# Setting the dataset to follow Hugging Face format
from datasets import Dataset

def to_dataset(pairs):
    return Dataset.from_dict({
        "translation": [{"en": en, "xh": xh} for en, xh in pairs]
    })

train_ds = to_dataset(train)
val_ds = to_dataset(val)
test_ds = to_dataset(test)

### Loading Pre-trained Tokenizer and Model

In [None]:
from transformers import MarianTokenizer, MarianMTModel

model_name = "Helsinki-NLP/opus-mt-en-xh"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Tokenize the data
def preprocess(example):
    # Extract English and isiXhosa lists from the batch
    english_texts = [item["en"] for item in example["translation"]]
    xhosa_texts = [item["xh"] for item in example["translation"]]

    # Tokenize English texts
    model_inputs = tokenizer(english_texts, truncation=True, padding="max_length", max_length=128)

    # Tokenize isiXhosa texts
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(xhosa_texts, truncation=True, padding="max_length", max_length=128)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = train_ds.map(lambda x: preprocess(x), batched=True) # Pass the whole batch dictionary 'x'
tokenized_val = val_ds.map(lambda x: preprocess(x), batched=True) # Pass the whole batch dictionary 'x'

Map:   0%|          | 0/101356 [00:00<?, ? examples/s]



Map:   0%|          | 0/12669 [00:00<?, ? examples/s]

In [None]:
# Training
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir="./en-xh-model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=True,
    report_to="wandb",
    logging_dir="./logs",
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Seq2SeqTrainer(


In [None]:
!pip install -r requirements.txt



In [None]:
!pip install wandb



In [None]:
%cd /content/drive/MyDrive/machine_translation_xhosa/language-translation
!git pull

/content/drive/MyDrive/machine_translation_xhosa/language-translation
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (1/1), done.[K
remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0 (from 0)[K
Unpacking objects: 100% (3/3), 378 bytes | 15.00 KiB/s, done.
From https://github.com/szinja/language-translation
   e1d8a9e..bb6b368  main       -> origin/main
Updating e1d8a9e..bb6b368
Fast-forward
 train.py | 7 [32m+++++[m[31m--[m
 1 file changed, 5 insertions(+), 2 deletions(-)


In [None]:
print(train_ds[0])

{'translation': {'en': 'The methodology was developed in a 4 Country Project with South Africa , Uganda , Zimbabwe and Ghana , and has been refined in second phase in South Africa from 2003-5 , where it was tested and revised based on the experience of 8 pilots : Nkonkobe ( E Cape ) , Greater Tzaneen and BelaBela ( Limpopo ) , eThekwini and Msunduzi ( Kwazulu-Natal ) , Mangaung and Maluti-a-Phofung ( Free State ) and Mbombela ( Mpumalanga ) .', 'xh': 'Le ndlela yokusebenza yaye yaqulunqwa kwiProjekthi yaMazwe aMane , anoMzantsi Afrika , iYuganda , iZimbabwe neGhana , kwaye iye yasukulwa kwisigaba sesibini eMzantsi Afrika ukusukela ku-2003 ukuya ku-2005 , apho yathi yavavanywa khona yahlaziywa kusekelwe kumava eeprojekthi ezisibhozo ( 8 ) zovavanyo : iNkonkobe ( Mpuma Koloni ) , iGreater Tzaneen neBelaBela ( Limpopo ) , eThekwini naseMsunduzi ( KwaZulu-Natal ) , iMangaung neMaluti-a-Phofung ( Freyistati ) neMbombela ( Mpumalanga ) .'}}


In [None]:
!pip install sacremoses
!pip install -U transformers

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/897.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m409.6/897.5 kB[0m [31m12.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1


In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
%cd /content/drive/MyDrive/machine_translation_xhosa/language-translation
!python train.py --en_file english.txt --xh_file xhosa.txt --epochs 3

/content/drive/MyDrive/machine_translation_xhosa/language-translation
2025-05-17 18:33:41.699404: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747506821.721669   19250 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747506821.728340   19250 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[34m[1mwandb[0m: Currently logged in as: [33mszinja[0m ([33mszinja-university-of-rochester[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.19.11
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/drive/MyDrive/machine_translation_xhosa/language-tr

In [48]:
%cd /content/drive/MyDrive/machine_translation_xhosa/language-translation
!git pull

/content/drive/MyDrive/machine_translation_xhosa/language-translation
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 21 (delta 14), reused 14 (delta 8), pack-reused 0 (from 0)[K
Unpacking objects: 100% (21/21), 3.45 KiB | 20.00 KiB/s, done.
From https://github.com/szinja/language-translation
   bb6b368..f4e8ae1  main       -> origin/main
Updating bb6b368..f4e8ae1
Fast-forward
 README.md        |  6 [32m++[m[31m----[m
 requirements.txt |  3 [32m++[m[31m-[m
 results.ipynb    | 37 [31m-------------------------------------[m
 train.py         | 21 [32m+++++++++++++++[m[31m------[m
 4 files changed, 19 insertions(+), 48 deletions(-)
 delete mode 100644 results.ipynb


### Evaluating our model

In [52]:
# Overview evaluation of En-Xho translation on 100 high quality examples
%cd /content/drive/MyDrive/machine_translation_xhosa/language-translation

from transformers import MarianTokenizer, MarianMTModel

model_path = "/content/en-xh-model"

tokenizer = MarianTokenizer.from_pretrained(model_path)
model = MarianMTModel.from_pretrained(model_path)

text = ["The child is playing outside."]
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, num_beams=4, max_length=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

/content/drive/MyDrive/machine_translation_xhosa/language-translation
['Umntwana udlala phandle .']


In [53]:
# Loading 100 pairs from dataset
with open("english.txt", encoding="utf-8") as f:
    en_lines = [line.strip() for line in f.readlines()]

with open("xhosa.txt", encoding="utf-8") as f:
    xh_lines = [line.strip() for line in f.readlines()]

# 100 samples from later in the dataset
test_en = en_lines[100000:100100]
test_xh = xh_lines[100000:100100]

# Saving for evaluation
with open("test.en", "w", encoding="utf-8") as f:
    f.write("\n".join(test_en))

with open("test.xh", "w", encoding="utf-8") as f:
    f.write("\n".join(test_xh))

In [54]:
from transformers import MarianTokenizer, MarianMTModel

model_path = "/content/en-xh-model"
tokenizer = MarianTokenizer.from_pretrained(model_path)
model = MarianMTModel.from_pretrained(model_path)

# Load test set
with open("test.en", encoding="utf-8") as f:
    en_test = [line.strip() for line in f.readlines()]

# Translate
inputs = tokenizer(en_test, return_tensors="pt", padding=True, truncation=True, max_length=128)
translated = model.generate(**inputs, num_beams=5, max_length=128)
xh_pred = tokenizer.batch_decode(translated, skip_special_tokens=True)

# Clean predictions
xh_pred = [s.strip() for s in xh_pred]

In [55]:
import evaluate
bleu = evaluate.load("sacrebleu")

# Load gold xh references
with open("test.xh", encoding="utf-8") as f:
    xh_gold = [line.strip() for line in f.readlines()]

# sacrebleu expects List[List[str]]
refs = [[ref] for ref in xh_gold]
bleu_score = bleu.compute(predictions=xh_pred, references=refs)

print(f"BLEU on 100-test set: {bleu_score['score']:.2f}")

# Save for reference
with open("predictions.txt", "w", encoding="utf-8") as f:
    for src, ref, hyp in zip(en_test, xh_gold, xh_pred):
        f.write(f"EN: {src}\nREF: {ref}\nHYP: {hyp}\n\n")

BLEU on 100-test set: 13.68


In [56]:
# Syncing with GitHub
%cd /content/drive/MyDrive/machine_translation_xhosa/language-translation
%ls

/content/drive/MyDrive/machine_translation_xhosa/language-translation
english.txt   predictions.txt   setup.sh          test.xh   xhosa.txt
[0m[01;34men-xh-model[0m/  README.md         streamlit_app.py  train.py
inference.py  requirements.txt  test.en           [01;34mwandb[0m/


In [98]:
!rm -f /content/drive/MyDrive/machine_translation_xhosa/language-translation/.git/index.lock
!git add english.txt xhosa.txt test.en test.xh predictions.txt

In [64]:
!git config --global user.email "szinja@u.rochester.edu"
!git config --global user.name "szinja"

In [96]:
!git add .gitignore
!git commit -m "Ignore large checkpoint files"

[main 0df861b] Ignore large checkpoint files
 6 files changed, 254025 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 english.txt
 create mode 100644 predictions.txt
 create mode 100644 test.en
 create mode 100644 test.xh
 create mode 100644 xhosa.txt


In [102]:
!git commit -m "Add cleaned model config/tokenizer files (no large checkpoints)"

[main 6052e6f] Add cleaned model config/tokenizer files (no large checkpoints)
 8 files changed, 61402 insertions(+)
 create mode 100644 en-xh-model/config.json
 create mode 100644 en-xh-model/generation_config.json
 create mode 100644 en-xh-model/source.spm
 create mode 100644 en-xh-model/special_tokens_map.json
 create mode 100644 en-xh-model/target.spm
 create mode 100644 en-xh-model/tokenizer_config.json
 create mode 100644 en-xh-model/training_args.bin
 create mode 100644 en-xh-model/vocab.json


In [104]:
!git push

Enumerating objects: 20, done.
Counting objects:   5% (1/20)Counting objects:  10% (2/20)Counting objects:  15% (3/20)Counting objects:  20% (4/20)Counting objects:  25% (5/20)Counting objects:  30% (6/20)Counting objects:  35% (7/20)Counting objects:  40% (8/20)Counting objects:  45% (9/20)Counting objects:  50% (10/20)Counting objects:  55% (11/20)Counting objects:  60% (12/20)Counting objects:  65% (13/20)Counting objects:  70% (14/20)Counting objects:  75% (15/20)Counting objects:  80% (16/20)Counting objects:  85% (17/20)Counting objects:  90% (18/20)Counting objects:  95% (19/20)Counting objects: 100% (20/20)Counting objects: 100% (20/20), done.
Delta compression using up to 12 threads
Compressing objects: 100% (17/17), done.
Writing objects: 100% (19/19), 10.59 MiB | 14.01 MiB/s, done.
Total 19 (delta 3), reused 13 (delta 2), pack-reused 0
remote: Resolving deltas: 100% (3/3), done.[K
To https://github.com/szinja/language-translation.git
   f4e8ae1..6052e6f 

### Hyperparameter Tuning

In [None]:
# Applying optimization techniques: dropout tuning, learning rate, epoch selection with early stop
