# üåç English to Punjabi Translation Model Training

This notebook contains the complete pipeline for fine-tuning a transformer model for English to Punjabi translation. 

**Project:** NLP Project - Annual Report Summarizer  
**Task:** Multilingual Integration (Punjabi)

### 1. Install Dependencies

In [None]:
#!pip install transformers[torch] datasets sacrebleu sentencepiece evaluate rouge_score

### 2. Load Dataset

In [18]:
from datasets import load_dataset

try:
    print("üîç Loading OPUS-100 for English-Punjabi...")
    raw_datasets = load_dataset("opus100", "en-pa")
    print("‚úÖ Successfully loaded OPUS-100!")
except Exception as e:
    print(f"‚ùå Failed to load dataset: {e}")
    raise RuntimeError("Could not load any English-Punjabi dataset.")

print(raw_datasets)

üîç Loading OPUS-100 for English-Punjabi...
‚úÖ Successfully loaded OPUS-100!
DatasetDict({
    test: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 107296
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
})


### 3. Initialize Model and Tokenizer
For Indian languages, Helsinki-NLP uses a group model called `Helsinki-NLP/opus-mt-en-inc` (inc = Indic). 

In [19]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_checkpoint = "Helsinki-NLP/opus-mt-en-inc"

print(f"ü§ñ Loading model and tokenizer: {model_checkpoint}")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
print("‚úÖ Model loaded successfully!")

ü§ñ Loading model and tokenizer: Helsinki-NLP/opus-mt-en-inc
‚úÖ Model loaded successfully!


### 4. Preprocessing

In [20]:
max_input_length = 128
max_target_length = 128
target_token = ">>pan<< " 

def preprocess_function(examples):
    if "translation" in examples:
        inputs = [target_token + ex.get("en", "") for ex in examples["translation"]]
        targets = [ex.get("pa", "") for ex in examples["translation"]]
    else:
        inputs = [target_token + text for text in examples.get("en", [])]
        targets = examples.get("pa", [])
        
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



Map:   0%|          | 0/107296 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

### 5. Evaluation Metrics
To measure how good the model is, we use **BLEU** and **chrF++**.

In [21]:
import evaluate
import numpy as np

metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels since we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    chrf = chrf_metric.compute(predictions=decoded_preds, references=decoded_labels)
    
    return {
        "bleu": result["score"], 
        "chrf": chrf["score"],
        "gen_len": np.mean([np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds])
    }

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

### 6. Training Setup (SKIP IF ALREADY TRAINED)

In [22]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import torch

train_sample_size = 30000 
train_dataset = tokenized_datasets["train"].select(range(min(train_sample_size, len(tokenized_datasets["train"]))))

batch_size = 16
args = Seq2SeqTrainingArguments(
    "punjabi-translator-finetuned",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True if torch.cuda.is_available() else False,
    push_to_hub=False
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Seq2SeqTrainer(


### 7. Fine-tuning

In [None]:
# trainer.train() # Uncomment to train

### 8. Testing the Saved Model
Run this section if you want to test your already saved model from the `./models/punjabi_translator` folder.

In [23]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments
import torch
import pandas as pd

model_path = "./models/punjabi_translator"

print(f"üöÄ Loading saved model from {model_path}...")
test_tokenizer = AutoTokenizer.from_pretrained(model_path)
test_model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

if torch.cuda.is_available():
    test_model = test_model.to("cuda")

# Re-setup trainer for evaluation only
test_args = Seq2SeqTrainingArguments(
    "eval_output",
    predict_with_generate=True,
    per_device_eval_batch_size=16,
    fp16=True if torch.cuda.is_available() else False
)

test_trainer = Seq2SeqTrainer(
    test_model,
    test_args,
    data_collator=data_collator,
    tokenizer=test_tokenizer,
    compute_metrics=compute_metrics
)

print("üìä Running evaluation on full test set...")
test_metrics = test_trainer.evaluate(eval_dataset=tokenized_datasets["test"], metric_key_prefix="test")

print(f"\nüèÜ Final Test BLEU Score: {test_metrics.get('test_bleu', 0):.2f}")
print(f"üèÜ Final Test chrF++ Score: {test_metrics.get('test_chrf', 0):.2f}")

# Sample translations
print("\nüëÄ Generating qualitative sample results...")
test_samples = raw_datasets["test"].select(range(5))
qualitative_results = []

for sample in test_samples:
    en_text = sample["translation"]["en"]
    pa_ref = sample["translation"]["pa"]
    input_text = target_token + en_text
    
    inputs = test_tokenizer(input_text, return_tensors="pt").to(test_model.device)
    with torch.no_grad():
        outputs = test_model.generate(**inputs, max_length=128, num_beams=5)
    pa_pred = test_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    qualitative_results.append({
        "English Source": en_text,
        "Human Punjabi (Reference)": pa_ref,
        "Model Punjabi (Output)": pa_pred
    })

display(pd.DataFrame(qualitative_results))

üöÄ Loading saved model from ./models/punjabi_translator...


  test_trainer = Seq2SeqTrainer(


üìä Running evaluation on full test set...



üèÜ Final Test BLEU Score: 45.01
üèÜ Final Test chrF++ Score: 70.28

üëÄ Generating qualitative sample results...


Unnamed: 0,English Source,Human Punjabi (Reference),Model Punjabi (Output)
0,Published,‡®™‡®¨‡®≤‡®ø‡®∏‡®º ‡®ï‡©Ä‡®§‡©á,‡®™‡®¨‡®≤‡®ø‡®∏‡®º ‡®ï‡©Ä‡®§‡©á
1,Name:,‡®®‡®æ‡®Ç:,‡®®‡®æ‡®Ç:
2,Ignored,‡®∏‡®∞‡©ã‡®§ ‡®´‡®æ‡®á‡®≥‡®æ‡®Ç:,‡®∏‡®∞‡©ã‡®§ ‡®´‡®æ‡®á‡®≤‡®º‡®æ‡®Ç:
3,Thank you for using KDE,KDE ‡®µ‡®∞‡®§‡®£ ‡®≤‡®à ‡®ß‡©∞‡®®‡®µ‡®æ‡®¶,KDE ‡®µ‡®∞‡®§‡®£ ‡®≤‡®à ‡®ß‡©∞‡®®‡®µ‡®æ‡®¶
4,& Delete,‡®π‡®ü‡®æ‡®ì( D),‡®π‡®ü‡®æ‡®ì( D)


### 9. Interactive Manual Test
Type any English sentence below to see how the model translates it.

In [24]:
def translate_sentence(sentence):
    input_text = target_token + sentence
    inputs = test_tokenizer(input_text, return_tensors="pt").to(test_model.device)
    with torch.no_grad():
        outputs = test_model.generate(**inputs, max_length=128, num_beams=5)
    return test_tokenizer.decode(outputs[0], skip_special_tokens=True)

my_sentence = "Despite the overwhelming challenges posed by the rapidly changing climate and the increasing scarcity of natural resources, governments around the world are still struggling to implement effective policies that balance economic growth with environmental sustainability, which has led to debates on the need for more urgent and innovative solutions."
print(f"English: {my_sentence}")
print(f"Punjabi: {translate_sentence(my_sentence)}")

English: Despite the overwhelming challenges posed by the rapidly changing climate and the increasing scarcity of natural resources, governments around the world are still struggling to implement effective policies that balance economic growth with environmental sustainability, which has led to debates on the need for more urgent and innovative solutions.
Punjabi: ‡®Æ‡©å‡®∏‡®Æ ‡®§‡©á‡®ú‡®º‡©Ä ‡®®‡®æ‡®≤ ‡®¨‡®¶‡®≤‡®¶‡©á ‡®Æ‡©å‡®∏‡®Æ ‡®Ö‡®§‡©á ‡®ï‡©Å‡®¶‡®∞‡®§‡©Ä ‡®µ‡®æ‡®§‡®æ‡®µ‡®∞‡®£ ‡®¶‡©Ä ‡®µ‡®ß ‡®∞‡®π‡©Ä ‡®∏‡®Æ‡©±‡®∏‡®ø‡®Ü ‡®¶‡©á ‡®¨‡®æ‡®µ‡®ú‡©Ç‡®¶, ‡®∏‡©∞‡®∏‡®æ‡®∞ ‡®≠‡®∞ ‡®¶‡©Ä‡®Ü‡®Ç ‡®∏‡®∞‡®ï‡®æ‡®∞‡®æ‡®Ç ‡®π‡®æ‡®≤‡©á ‡®µ‡©Ä ‡®™‡©ç‡®∞‡®≠‡®æ‡®µ‡®∏‡®º‡®æ‡®≤‡©Ä ‡®™‡®æ‡®≤‡®∏‡©Ä‡®Ü‡®Ç ‡®®‡©Ç‡©∞ ‡®≤‡®æ‡®ó‡©Ç ‡®ï‡®∞‡®® ‡®≤‡®à ‡®∏‡©∞‡®ò‡®∞‡®∏‡®º ‡®ï‡®∞ ‡®∞‡®π‡©Ä‡®Ü‡®Ç ‡®π‡®®, ‡®ú‡©ã ‡®µ‡®æ‡®§‡®æ‡®µ‡®∞‡®£ ‡®®‡©Ç‡©∞ ‡®∏‡®•‡®ø‡®∞ ‡®∞‡©±‡®ñ‡®£ ‡®≤‡®à ‡®Ü‡®∞‡®•‡®ø‡®ï ‡®§‡®∞‡©±‡®ï‡©Ä ‡®®‡©Ç‡©∞ ‡®∏‡©∞‡®§‡©Å‡®≤‡®ø‡®§ ‡®∞‡©±‡®ñ‡®¶‡©Ä‡®Ü‡®Ç ‡®π‡®®, ‡®ú‡©ã ‡®ï‡®ø ‡®π‡©ã‡®∞ ‡®ú‡®º‡®∞‡©Ç‡®∞‡©Ä ‡®Ö‡®§‡©á ‡