# **Sinhala-Tamil Neural Machine Translation Using Pivot-Based Transfer Learning**
This Colab notebook demonstrates a proof of concept for a neural machine translation system between Sinhala and Tamil, using English as a pivot language. I will use the mBART model and AdapterFusion to fine-tune the model for low-resource language pairs.


In [3]:
# Install Hugging Face Transformers and other required libraries
!pip install transformers torch sentencepiece sacrebleu adapter-transformers peft datasets scikit-learn

Collecting transformers
  Using cached transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
Using cached transformers-4.47.1-py3-none-any.whl (10.1 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.49.0
    Uninstalling transformers-4.49.0:
      Successfully uninstalled transformers-4.49.0
Successfully installed transformers-4.47.1


In [4]:
!pip install --upgrade peft transformers

Collecting transformers
  Using cached transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Using cached transformers-4.49.0-py3-none-any.whl (10.0 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.47.1
    Uninstalling transformers-4.47.1:
      Successfully uninstalled transformers-4.47.1
Successfully installed transformers-4.49.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
adapters 1.1.0 requires transformers~=4.47.1, but you have transformers 4.49.0 which is incompatible.


In [5]:
import pandas as pd
from transformers import MBartForConditionalGeneration, MBartTokenizer, Trainer, TrainingArguments
from adapters import AdapterConfig
from sklearn.model_selection import train_test_split
from sacrebleu import corpus_bleu
from peft import PeftModel, LoraConfig


  from .autonotebook import tqdm as notebook_tqdm


## **Data Loading and Initial Inspection**
Load the datasets, inspect their structure, and prepare for cleaning.


In [6]:
# from google.colab import drive
# drive.mount('/content/drive')


In [7]:
# Load initial datasets (assume they are named "sinhala_english_subset.csv" for Sinhala-English and "sinhala_english_subset.csv" for English-Tamil)
en_si_df = pd.read_csv('sinhala_english_subset.csv')
en_ta_df = pd.read_csv('tamil_english_subset.csv')

# Display the first few rows of each dataset
en_si_df.head(), en_ta_df.head()


(                                              source  \
 0  කම්හල් මිලිග්රෑම් වානේ ෆයිබර් cnc 1000W ලේසර් ...   
 1                                       Johnie පවසයි   
 2  (vi) 2015 වර්ෂය සඳහා ශ්‍රී ලංකා ප්‍රතිපත්ති අධ...   
 3  එවිට රජ්ජුරුවෝ උත්තරදෙමින්: සැබවක් නුඹලාට කියම...   
 4    සේවාව සඳහා ලියාපදිංචි වීම හා අනාවැකි ලබා ගැනීම.   
 
                                               target  
 0  Milligram Steel Fiber Cnc 1000W Laser Cutting ...  
 1                                        Johnie says  
 2  (vi) Annual Report of the Institute of Policy ...  
 3  Then he will prove to be king, but you must ha...  
 4       Registration and Predictions for the Service  ,
                                               source  \
 0  Factory Price Steel Fiber CNC 1000W Laser Cutt...   
 1                                        Johnie says   
 2  (VI) Annual Report of the Sri Lanka Policy Res...   
 3  The king replied, "I really say to you," I hav...   
 4     Service registration and re

## **Data Cleaning**
This step involves removing duplicates, handling missing values, and normalizing text.


In [8]:
# Remove any rows with null values
en_si_df.dropna(subset=['source', 'target'], inplace=True)
en_ta_df.dropna(subset=['source', 'target'], inplace=True)

# Remove duplicates
en_si_df.drop_duplicates(inplace=True)
en_ta_df.drop_duplicates(inplace=True)

# Normalize text (optional: lowercasing and trimming whitespace)
def normalize_text(text):
    return text.lower().strip()

en_si_df['source'] = en_si_df['source'].apply(normalize_text)
en_si_df['target'] = en_si_df['target'].apply(normalize_text)
en_ta_df['source'] = en_ta_df['source'].apply(normalize_text)
en_ta_df['target'] = en_ta_df['target'].apply(normalize_text)

# Verify cleaning results
en_si_df.head(), en_ta_df.head()


(                                              source  \
 0  කම්හල් මිලිග්රෑම් වානේ ෆයිබර් cnc 1000w ලේසර් ...   
 1                                       johnie පවසයි   
 2  (vi) 2015 වර්ෂය සඳහා ශ්‍රී ලංකා ප්‍රතිපත්ති අධ...   
 3  එවිට රජ්ජුරුවෝ උත්තරදෙමින්: සැබවක් නුඹලාට කියම...   
 4    සේවාව සඳහා ලියාපදිංචි වීම හා අනාවැකි ලබා ගැනීම.   
 
                                               target  
 0  milligram steel fiber cnc 1000w laser cutting ...  
 1                                        johnie says  
 2  (vi) annual report of the institute of policy ...  
 3  then he will prove to be king, but you must ha...  
 4       registration and predictions for the service  ,
                                               source  \
 0  factory price steel fiber cnc 1000w laser cutt...   
 1                                        johnie says   
 2  (vi) annual report of the sri lanka policy res...   
 3  the king replied, "i really say to you," i hav...   
 4     service registration and re

## **Add Language Tokens**
mBART requires specifying source and target language tokens. For this project, we'll add `en_XX` for English, `si_LK` for Sinhala, and `ta_IN` for Tamil.


In [9]:
def add_lang_tokens(row, src_lang, tgt_lang):
    row["source"] = f"{src_lang} {row['source']}"
    row["target"] = f"{tgt_lang} {row['target']}"
    return row

# Add language tokens
en_si_df = en_si_df.apply(add_lang_tokens, src_lang="si_LK", tgt_lang="en_XX", axis=1)
en_ta_df = en_ta_df.apply(add_lang_tokens, src_lang="en_XX", tgt_lang="ta_IN", axis=1)

# Save cleaned data for future use if needed
en_si_df.to_csv("cleaned_en_si.csv", index=False)
en_ta_df.to_csv("cleaned_en_ta.csv", index=False)


## **Data Splitting**
Split each dataset into training and validation sets.


In [10]:
train_en_si, val_en_si = train_test_split(en_si_df, test_size=0.1, random_state=42)
train_en_ta, val_en_ta = train_test_split(en_ta_df, test_size=0.1, random_state=42)


## **Model and Tokenizer Setup**
Load mBART model for multilingual translation, which will be fine-tuned on our data.


In [11]:
from transformers import MBartForConditionalGeneration, MBart50Tokenizer

# Load the model and correct tokenizer
model_name = 'facebook/mbart-large-50-many-to-many-mmt'
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = MBart50Tokenizer.from_pretrained(model_name)

## **Adapter Configuration and Training**
Adapters allow parameter-efficient fine-tuning, which is ideal for low-resource languages.


In [12]:
# Define the adapter configuration with target modules
adapter_config = LoraConfig(
    task_type="SEQ_2_SEQ_LM",    # Task type for sequence-to-sequence language modeling
    r=16,                         # LoRA rank
    lora_alpha=32,               # Scaling factor for LoRA
    lora_dropout=0.1,            # Dropout rate for LoRA
    target_modules=["q_proj", "v_proj"]  # Target modules for LoRA layers in the transformer model
)

# Wrap the model with PeftModel for adapter functionality
peft_model = PeftModel(model, adapter_config)

# Add adapters for Sinhala-English and English-Tamil with the config passed directly
peft_model.add_adapter("sinhala_english", peft_config=adapter_config)
peft_model.add_adapter("english_tamil", peft_config=adapter_config)

## **Training Setup**
Set up training arguments, including batch size, learning rate, and number of epochs.


In [13]:
!pip install wandb



In [14]:
# First initialize wandb
import wandb
wandb.init(project="training-2", name="training-2")

# Then set up training arguments
training_args = TrainingArguments(
    output_dir='./training-2',
    evaluation_strategy="epoch",

    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="wandb",           # Enable wandb logging
    run_name="training-1",       # Name of your specific run
    logging_dir='./logs',        # Directory for storing logs
    logging_steps=100,
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mlprajika[0m ([33mlprajika-test[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




## **Model Training**
Use Hugging Face's Trainer API to train the model on the pivot language datasets.


In [15]:
from datasets import Dataset

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_en_si)
val_dataset = Dataset.from_pandas(val_en_si)

# Rename columns to be compatible with the Trainer (optional, depending on your tokenizer function)
train_dataset = train_dataset.rename_column("source", "input_text")
train_dataset = train_dataset.rename_column("target", "label")
val_dataset = val_dataset.rename_column("source", "input_text")
val_dataset = val_dataset.rename_column("target", "label")


wandb API key: 52f0d13b11df759a742c95b72ceb0a55ee475fd8

## **Phase Training Approach**
Phase training involves two distinct stages:
1. **Stage 1**: Fine-tune the model on Sinhala-English pairs to capture Sinhala to English translation patterns.
2. **Stage 2**: Fine-tune on English-Tamil pairs to complete the pivot-based translation from Sinhala to Tamil.

This approach allows the model to adapt to each translation task sequentially, potentially improving the final Sinhala-Tamil translation quality by reducing error accumulation.


In [16]:
# Function to tokenize Sinhala-English data
def tokenize_function_si(examples):
    tokenizer.src_lang = "si_LK"
    tokenizer.tgt_lang = "en_XX"
    model_inputs = tokenizer(examples["input_text"], max_length=128, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["label"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the tokenization to Sinhala-English dataset
train_dataset_si = train_dataset.map(tokenize_function_si, batched=True)
val_dataset_si = val_dataset.map(tokenize_function_si, batched=True)

# Remove unnecessary columns
train_dataset_si = train_dataset_si.remove_columns(["input_text", "label", "__index_level_0__"])
val_dataset_si = val_dataset_si.remove_columns(["input_text", "label", "__index_level_0__"])

# Set up the Trainer for Sinhala-English fine-tuning
trainer_si = Trainer(
    model=peft_model,  # Use PeftModel with AdapterFusion
    args=training_args,
    train_dataset=train_dataset_si,
    eval_dataset=val_dataset_si
)

# Run training on Sinhala-English pairs
print("Starting Stage 1: Sinhala-English Training")
trainer_si.train()


Map: 100%|██████████| 12113/12113 [00:04<00:00, 2462.09 examples/s]
Map: 100%|██████████| 1346/1346 [00:00<00:00, 2217.73 examples/s]
No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting Stage 1: Sinhala-English Training


Epoch,Training Loss,Validation Loss
1,8.9666,No log
2,8.9077,No log
3,8.9899,No log




TrainOutput(global_step=9087, training_loss=9.055829187214565, metrics={'train_runtime': 97829.9978, 'train_samples_per_second': 0.371, 'train_steps_per_second': 0.093, 'total_flos': 1.0041448269348864e+16, 'train_loss': 9.055829187214565, 'epoch': 3.0})

In [30]:
print("Training complete for Sinhala-English")
print("Saving Sinhala-English adapter...")
peft_model.set_adapter("sinhala_english")
peft_model.save_pretrained("./models/sinhala_english")

Training complete for Sinhala-English
Saving Sinhala-English adapter...


In [17]:
training_args = TrainingArguments(
    output_dir='./training-2',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="wandb",           # Enable wandb logging
    run_name="training-1",       # Name of your specific run
    logging_dir='./logs',        # Directory for storing logs
    logging_steps=100,
)



In [18]:
from datasets import Dataset

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_en_ta)
val_dataset = Dataset.from_pandas(val_en_ta)

# Rename columns to be compatible with the Trainer (optional, depending on your tokenizer function)
train_dataset = train_dataset.rename_column("source", "input_text")
train_dataset = train_dataset.rename_column("target", "label")
val_dataset = val_dataset.rename_column("source", "input_text")
val_dataset = val_dataset.rename_column("target", "label")

In [19]:
# Function to tokenize English-Tamil data
def tokenize_function_ta(examples):
    tokenizer.src_lang = "en_XX"
    tokenizer.tgt_lang = "ta_IN"
    model_inputs = tokenizer(examples["input_text"], max_length=128, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["label"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the tokenization to English-Tamil dataset
train_dataset_ta = train_dataset.map(tokenize_function_ta, batched=True)
val_dataset_ta = val_dataset.map(tokenize_function_ta, batched=True)

# Remove unnecessary columns
train_dataset_ta = train_dataset_ta.remove_columns(["input_text", "label", "__index_level_0__"])
val_dataset_ta = val_dataset_ta.remove_columns(["input_text", "label", "__index_level_0__"])

# Set up the Trainer for English-Tamil fine-tuning
trainer_ta = Trainer(
    model=peft_model,  # Continue with PeftModel
    args=training_args,
    train_dataset=train_dataset_ta,
    eval_dataset=val_dataset_ta
)

# Run training on English-Tamil pairs
print("Starting Stage 2: English-Tamil Training")
trainer_ta.train()


Map: 100%|██████████| 12076/12076 [00:04<00:00, 2658.40 examples/s]
Map: 100%|██████████| 1342/1342 [00:00<00:00, 2702.12 examples/s]
No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting Stage 2: English-Tamil Training


Epoch,Training Loss,Validation Loss
1,8.6412,No log
2,8.4173,No log
3,8.5155,No log


TrainOutput(global_step=9057, training_loss=8.531871390050513, metrics={'train_runtime': 97028.9403, 'train_samples_per_second': 0.373, 'train_steps_per_second': 0.093, 'total_flos': 1.0010775968022528e+16, 'train_loss': 8.531871390050513, 'epoch': 3.0})

In [31]:
print("Training complete for English-Tamil")
print("Saving English-Tamil adapter...")
peft_model.set_adapter("english_tamil")
peft_model.save_pretrained("./models/english_tamil")

Training complete for English-Tamil
Saving English-Tamil adapter...


In [20]:
import torch
from sacrebleu import corpus_bleu

# Set device to 'cuda' if a GPU is available, otherwise use 'cpu'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

test_english = "Good morning"
inputs = tokenizer(test_english, return_tensors="pt", padding=True).to(device)
peft_model.set_adapter("english_tamil")
forced_bos_token_id = tokenizer.lang_code_to_id["ta_IN"]
output = peft_model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,  # Ensure Tamil output
    num_beams=5,  # Use beam search with 5 beams
    max_length=50,  # Ensure longer outputs if needed
    early_stopping=True  # Stop when the best translation is found
)
print(tokenizer.decode(output[0], skip_special_tokens=True))  # Should output "காலை வணக்கம்"

நன்னாள்


## **Evaluation**
After completing both stages, Ievaluate the model’s performance on the validation dataset using the BLEU score metric.


## **Evaluation**
Calculate BLEU scores to evaluate translation quality.


In [21]:
import torch
from sacrebleu import corpus_bleu

def calculate_bleu_score_pivot(val_dataset_si_ta, si_source_col="source", ta_target_col="target"):
    """
    Calculate BLEU score for Sinhala → Tamil translation via English pivot.

    Args:
        val_dataset_si_ta (pd.DataFrame): A pandas DataFrame with columns:
            - "source": Sinhala text
            - "target": Reference Tamil translation

    Returns:
        None. Prints BLEU score and translation samples for debugging.
    """
    predictions = []
    references = []

    peft_model.to(device)

    for index, row in val_dataset_si_ta.iterrows():
        sinhala_text = row[si_source_col].replace("si_LK ", "")
        tamil_reference = row[ta_target_col].replace("ta_IN ", "")

        # ------------------- STEP 1: Sinhala → English -------------------
        peft_model.set_adapter("sinhala_english")
        tokenizer.src_lang = "si_LK"
        tokenizer.tgt_lang = "en_XX"

        inputs = tokenizer(sinhala_text, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
        english_output = peft_model.generate(
            **inputs,
            num_beams=5,  # Use beam search for better accuracy
            max_length=50,  # Ensure longer sentences are captured
            early_stopping=True
        )

        english_text = tokenizer.decode(english_output[0], skip_special_tokens=True)
        print(f"[{index}] English Pivot: {english_text}")  # Debug: Print English pivot output

        # ------------------- STEP 2: English → Tamil -------------------
        peft_model.set_adapter("english_tamil")
        tokenizer.src_lang = "en_XX"
        tokenizer.tgt_lang = "ta_IN"

        # Ensure Tamil output by forcing the Tamil language token
        forced_bos_token_id = tokenizer.lang_code_to_id["ta_IN"]

        inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
        tamil_output = peft_model.generate(
            **inputs,
            forced_bos_token_id=forced_bos_token_id,  # Force Tamil as output
            num_beams=5,  # Beam search for better results
            temperature=0.7,  # Control randomness
            max_length=50,  # Ensure complete translations
            early_stopping=True
        )

        tamil_prediction = tokenizer.decode(tamil_output[0], skip_special_tokens=True)
        print(f"[{index}] Predicted Tamil: {tamil_prediction}")  # Debug: Print Tamil output
        print(f"[{index}] Reference Tamil: {tamil_reference}")  # Debug: Print reference Tamil

        predictions.append(tamil_prediction)
        references.append([tamil_reference])

    # Compute BLEU score
    bleu_score = corpus_bleu(predictions, references)
    print(f"\n📊 BLEU Score: {bleu_score.score:.3f}")

# Load validation dataset
val_dataset_si_ta = pd.read_csv("sinhala_tamil_val.csv")

# Calculate BLEU score
calculate_bleu_score_pivot(val_dataset_si_ta)


[0] English Pivot: At that time, at the Ministry of Education, the Seisakugaku Kyojo began in 1874 with a proposal by Mr. Gottfried Wagener, a German-Ugrian scientist. Wagener became aware of the need for




[0] Predicted Tamil: அந்த நேரத்தில், கல்வி அமைச்சகத்தில், ஜேர்மனிய-யூக்ரிய விஞ்ஞானி திரு. கோட்ஃபிரீட் வாகேனர் முன்மொழியினால் 1874 ஆம் ஆண்டில் Seisakugaku Kyojo தொடங்க
[0] Reference Tamil: ஏறக்குறைய அதே சமயத்தில், கல்வி அமைச்சின் அதிகாரிகளை Seisakugaku Kyojo நிறுவப்பட்டது 1874 கோட்ஃபிரெய்ட் Wagener அவர்களின் ஆலோசனையின் பேரில், ஒரு ஜெர்மனில் பிறந்த விஞ்ஞானி. Wagener மூத்த பொறியாளர்கள் மற்றும் பொறியாளர்கள் பயிரிட பொருட்டு ஜப்பான் உள்ள நடைமுறை தொழில்நுட்ப கல்வி தேவையை பற்றி குரல் இருந்தது. Seisakugaku Kyojo மூன்று ஆண்டுகளுக்கு பின்னர் மூடப்பட்டது என்றாலும், அது மாணவர்கள் ஒரு புரட்சிகர பள்ளி நடைமுறை திறன்கள் ஜப்பனீஸ் தொழில் நவீனமயமானது தேவையான பொறியாளர்கள் தயாரிக்க அறிவியல் கோட்பாடுகள் இணைந்து கற்று செய்விக்கப்பட்டது.
[1] English Pivot: Modifying, replacing, and servicing farm systems and when equipment fails professionals are aware.
[1] Predicted Tamil: வேளாண் அமைப்புகளை மாற்றுதல், மாற்றுதல், பராமரித்தல் மற்றும் சாதனங்கள் தோல்வியடையும்போது தொழில்நுட்ப வல்லுனர்கள் அறிந்துள்ளனர்.
[1

## **Translation Testing**
Test a sample translation to observe the model’s performance.


In [25]:
def translate_text(input_text):
    print(f"Input (Sinhala): {input_text}")

    # ------------------- STEP 1: Sinhala → English -------------------
    peft_model.set_adapter("sinhala_english")
    print(f"Active Adapter (Sinhala → English): {peft_model.active_adapter}")

    tokenizer.src_lang = "si_LK"
    tokenizer.tgt_lang = "en_XX"
    print(f"Tokenizer src_lang: {tokenizer.src_lang}, tgt_lang: {tokenizer.tgt_lang}")

    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    output = peft_model.generate(**inputs)
    english_text = tokenizer.decode(output[0], skip_special_tokens=True)

    print(f"English Pivot Output: {english_text}")

    # ------------------- STEP 2: English → Tamil (Fixed) -------------------
    peft_model.set_adapter("english_tamil")
    print(f"Active Adapter (English → Tamil): {peft_model.active_adapter}")

    tokenizer.src_lang = "en_XX"
    tokenizer.tgt_lang = "ta_IN"
    print(f"Tokenizer src_lang: {tokenizer.src_lang}, tgt_lang: {tokenizer.tgt_lang}")

    # 🛠 Fix 1: Add target language token **inside** the tokenization process
    inputs = tokenizer(english_text, return_tensors="pt").to(device)

    # 🛠 Fix 2: Explicitly set the forced decoder token to Tamil
    forced_bos_token_id = tokenizer.lang_code_to_id["ta_IN"]  # Ensure Tamil output

    output = peft_model.generate(
        **inputs,
        forced_bos_token_id=forced_bos_token_id  # Forces output language to Tamil
    )

    print(f"Raw Output Tokens (English → Tamil): {output}")

    tamil_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Translated Tamil Output: {tamil_text}")

    return tamil_text

# Test again
sample_text = "ඔයාගේ නම මොකද්ද?"  # Sinhala for "Good morning"
translated_text = translate_text(sample_text)
print(f"Final Translation (Sinhala → Tamil): {translated_text}")


Input (Sinhala): ඔයාගේ නම මොකද්ද?
Active Adapter (Sinhala → English): sinhala_english
Tokenizer src_lang: si_LK, tgt_lang: en_XX
English Pivot Output: What's your name?
Active Adapter (English → Tamil): english_tamil
Tokenizer src_lang: en_XX, tgt_lang: ta_IN
Raw Output Tokens (English → Tamil): tensor([[     2, 250044,  86136,  55241,   9784,     32,      2]])
Translated Tamil Output: உன் பெயர் என்ன?
Final Translation (Sinhala → Tamil): உன் பெயர் என்ன?


## **Conclusion**
This notebook provides a proof of concept for translating between Sinhala and Tamil using English as a pivot language, utilizing mBART with AdapterFusion for efficient training. Future steps could involve refining the model with additional data and adjusting adapter configurations to optimize translation accuracy.
