# In this Notebook:- 
* We fine-tuned the IndicTrans2-en-indic-dist-200M model on the mini-IITB English–Hindi parallel corpus for the EN→HI translation task. The resulting model was evaluated using a comprehensive suite of MT quality metrics, including BLEU, ChrF, COMET, BERTScore, and BLEURT.

In [10]:
#Checking wheather GPU is working or not
!nvidia-smi


Fri Dec 12 09:17:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

In [11]:
# installing dataset and transformer
!pip install datasets transformers[sentencepiece] sacrebleu -q

In [12]:
# to remove version conflict of Protobuf so, downgrade version of Protobuf
!pip install protobuf==3.20.3 



In [13]:
!pip uninstall -y pyarrow datasets
!pip cache purge

# install fresh, modern, compatible versions
!pip install --no-cache-dir "datasets>=2.21.0" "pyarrow>=15.0.0"
!pip uninstall -y numpy
!pip install numpy==1.26.4

Found existing installation: pyarrow 22.0.0
Uninstalling pyarrow-22.0.0:
  Successfully uninstalled pyarrow-22.0.0
Found existing installation: datasets 4.4.1
Uninstalling datasets-4.4.1:
  Successfully uninstalled datasets-4.4.1
Files removed: 6
Collecting datasets>=2.21.0
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0
  Downloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m332.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pyarrow, datasets
[31mERROR: pip's dependency resolver does not currently take into account all the p

In [14]:
# Importing all required modules
import os
import sys
import transformers
import torch  # pytorch Import
import sacrebleu
from torch.amp import autocast, GradScaler
from tqdm.auto import tqdm
from transformers import DataCollatorForSeq2Seq
from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.utils.data import DataLoader
from datasets import load_dataset # for loading the dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # For getting Embedding
from transformers import DataCollatorForSeq2Seq #getting sequential model and collator for loading batchwise of data
from torch.optim import AdamW # Optimizer



# Indictrans2-en-indic-dist-200M Model
* source: https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M

In [15]:
# Enter Access Token and rerun
from huggingface_hub import login
login(new_session=False)

# Note:
* I was using the free version of Kaggle, and the memory limit was getting exhausted while training the 1B-parameter model. Because of this constraint, I switched to using the 200M-parameter model instead.

In [16]:
ckpt = "ai4bharat/indictrans2-en-indic-dist-200M" # Model Checkpoint 

model = AutoModelForSeq2SeqLM.from_pretrained(
    ckpt,
    trust_remote_code=True,                                         
)

# Move safely to GPU
model = model.to(torch.float16).to("cuda")   

config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

configuration_indictrans.py:   0%|          | 0.00/14.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M:
- configuration_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_indictrans.py:   0%|          | 0.00/79.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M:
- modeling_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

# The Dataset¶

* Source: https://huggingface.co/datasets/atrisaxena/mini-iitb-english-hindi

In [17]:
raw_dataset = load_dataset("atrisaxena/mini-iitb-english-hindi")

README.md:   0%|          | 0.00/193 [00:00<?, ?B/s]

data/train.parquet:   0%|          | 0.00/2.87M [00:00<?, ?B/s]

data/validation.parquet:   0%|          | 0.00/84.7k [00:00<?, ?B/s]

data/test.parquet:   0%|          | 0.00/500k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/520 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2507 [00:00<?, ? examples/s]

In [18]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})

# Observation for Statistics related to dataset

In [19]:
# Import required libraries
import numpy as np
import math
import nltk
nltk.download("punkt")  # one-time

def add_stats(example):
    text = example["translation"]["en"]
    # guard
    if text is None: text = ""
    text = text.strip() # Removes unwanted spacing
    words = text.split()
    # sentence count (approx)
    sents = nltk.tokenize.sent_tokenize(text) if text else []
    example["num_words"] = len(words)
    example["num_chars"] = len(text)
    example["num_sentences"] = len(sents)
    return example

raw_dataset = raw_dataset.map(add_stats, batched=False)

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/520 [00:00<?, ? examples/s]

Map:   0%|          | 0/2507 [00:00<?, ? examples/s]

In [20]:
# Obtaining the Statistics:

def summary_stats(arr):
    arr = np.array(arr)
    return {
        "count": int(arr.size),
        "min": int(arr.min()) if arr.size>0 else None,
        "p1": int(np.percentile(arr, 1)) if arr.size>0 else None,
        "p10": int(np.percentile(arr, 10)) if arr.size>0 else None,
        "median": float(np.median(arr)) if arr.size>0 else None,
        "mean": float(arr.mean()) if arr.size>0 else None,
        "std": float(arr.std(ddof=0)) if arr.size>0 else None,
        "p90": int(np.percentile(arr, 90)) if arr.size>0 else None,
        "p99": int(np.percentile(arr, 99)) if arr.size>0 else None,
        "max": int(arr.max()) if arr.size>0 else None,
    }

for split in raw_dataset:
    d = raw_dataset[split]
    print(f"\n=== {split.upper()} ===")
    print("Words:", summary_stats(d["num_words"]))
    print("Chars:", summary_stats(d["num_chars"]))
    print("Sentences:", summary_stats(d["num_sentences"]))


=== TRAIN ===
Words: {'count': 20000, 'min': 0, 'p1': 1, 'p10': 1, 'median': 9.0, 'mean': 13.01335, 'std': 14.901938524148461, 'p90': 30, 'p99': 68, 'max': 335}
Chars: {'count': 20000, 'min': 0, 'p1': 4, 'p10': 8, 'median': 49.0, 'mean': 74.78915, 'std': 85.15344263315195, 'p90': 173, 'p99': 387, 'max': 1950}
Sentences: {'count': 20000, 'min': 0, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.14165, 'std': 0.5593614908983278, 'p90': 1, 'p99': 4, 'max': 18}

=== VALIDATION ===
Words: {'count': 520, 'min': 3, 'p1': 5, 'p10': 8, 'median': 16.0, 'mean': 17.71153846153846, 'std': 8.598382089452237, 'p90': 29, 'p99': 42, 'max': 62}
Chars: {'count': 520, 'min': 24, 'p1': 28, 'p10': 47, 'median': 97.5, 'mean': 105.025, 'std': 52.31679894568356, 'p90': 170, 'p99': 266, 'max': 358}
Sentences: {'count': 520, 'min': 1, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.0576923076923077, 'std': 0.2851087200499499, 'p90': 1, 'p99': 2, 'max': 4}

=== TEST ===
Words: {'count': 2507, 'min': 3, 'p1': 5, 'p10': 9

In [21]:
from datasets import DatasetDict

In [22]:
# Train has min length of sentences as 0 ('min': 0) so, we Remove these row from dataset
def not_empty(example):
    text = example["translation"]["en"]
    return text is not None and len(text.strip()) > 0
    
clean_train = raw_dataset["train"].filter(not_empty)
clean_val   = raw_dataset["validation"].filter(not_empty)
clean_test  = raw_dataset["test"].filter(not_empty)

raw_dataset = DatasetDict({
    "train": clean_train,
    "validation": clean_val,
    "test": clean_test
})

Filter:   0%|          | 0/20000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/520 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2507 [00:00<?, ? examples/s]

In [23]:
type(raw_dataset)

datasets.dataset_dict.DatasetDict

In [24]:
# compute p99 threshold('p99': 68 so, removing other outliers(longer than 68 words))
word_lengths = np.array(raw_dataset["train"]["num_words"])
p99_threshold = int(np.percentile(word_lengths, 99))
print("Removing sentences longer than:", p99_threshold, "words")
raw_dataset["train"] = raw_dataset["train"].filter(
    lambda ex: ex["num_words"] <= p99_threshold
)


Removing sentences longer than: 68 words


Filter:   0%|          | 0/19995 [00:00<?, ? examples/s]

In [25]:
# Obtaining the desired Statistics
def summary_stats(arr):
    arr = np.array(arr)
    return {
        "count": int(arr.size),
        "min": int(arr.min()) if arr.size>0 else None,
        "p1": int(np.percentile(arr, 1)) if arr.size>0 else None,
        "p10": int(np.percentile(arr, 10)) if arr.size>0 else None,
        "median": float(np.median(arr)) if arr.size>0 else None,
        "mean": float(arr.mean()) if arr.size>0 else None,
        "std": float(arr.std(ddof=0)) if arr.size>0 else None,
        "p90": int(np.percentile(arr, 90)) if arr.size>0 else None,
        "p99": int(np.percentile(arr, 99)) if arr.size>0 else None,
        "max": int(arr.max()) if arr.size>0 else None,
    }

for split in raw_dataset:
    d = raw_dataset[split]
    print(f"\n=== {split.upper()} ===")
    print("Words:", summary_stats(d["num_words"]))
    print("Chars:", summary_stats(d["num_chars"]))
    print("Sentences:", summary_stats(d["num_sentences"]))



=== TRAIN ===
Words: {'count': 19799, 'min': 1, 'p1': 1, 'p10': 1, 'median': 9.0, 'mean': 12.200717207939794, 'std': 12.126657765760921, 'p90': 29, 'p99': 54, 'max': 68}
Chars: {'count': 19799, 'min': 1, 'p1': 4, 'p10': 8, 'median': 48.0, 'mean': 70.23930501540482, 'std': 69.67776015890949, 'p90': 168, 'p99': 309, 'max': 493}
Sentences: {'count': 19799, 'min': 1, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.1209657053386535, 'std': 0.4557984939309774, 'p90': 1, 'p99': 3, 'max': 9}

=== VALIDATION ===
Words: {'count': 520, 'min': 3, 'p1': 5, 'p10': 8, 'median': 16.0, 'mean': 17.71153846153846, 'std': 8.598382089452237, 'p90': 29, 'p99': 42, 'max': 62}
Chars: {'count': 520, 'min': 24, 'p1': 28, 'p10': 47, 'median': 97.5, 'mean': 105.025, 'std': 52.31679894568356, 'p90': 170, 'p99': 266, 'max': 358}
Sentences: {'count': 520, 'min': 1, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.0576923076923077, 'std': 0.2851087200499499, 'p90': 1, 'p99': 2, 'max': 4}

=== TEST ===
Words: {'count': 2507, 

In [26]:
# Sample Example
raw_dataset['train'][0]

{'translation': {'en': 'The occupation of keeping bees.',
  'hi': 'मधुमक्खियों को पालने का कार्य। '},
 'num_words': 5,
 'num_chars': 31,
 'num_sentences': 1}

In [27]:
from datasets import DatasetDict

# New desired sizes
N_TRAIN = 2000
N_VAL   = 150
N_TEST  = 250

# Downsample using .select()
small_train = raw_dataset["train"].select(range(N_TRAIN))
small_val   = raw_dataset["validation"].select(range(N_VAL))
small_test  = raw_dataset["test"].select(range(N_TEST))

# Create a new DatasetDict
small_dataset = DatasetDict({
    "train": small_train,
    "validation": small_val,
    "test": small_test
})

small_dataset

DatasetDict({
    train: Dataset({
        features: ['translation', 'num_words', 'num_chars', 'num_sentences'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['translation', 'num_words', 'num_chars', 'num_sentences'],
        num_rows: 150
    })
    test: Dataset({
        features: ['translation', 'num_words', 'num_chars', 'num_sentences'],
        num_rows: 250
    })
})

In [28]:
raw_dataset = small_dataset  # As said in Task 2(for 2000 pair of sentence)

# Applying Tokenization:- How we obtain Embedding
* Attention Mask = Padding Mask x look Ahead Mask
* Input_ids = input_ids are tokenized text converted into numeric indices from tokenizer vocabulary.
* The model converts input_ids to embeddings internally through an embedding layer.

# Pipeline
* Text → Tokens → IDs → Embeddings → Transformer
* "I love India"
*      ↓              (tokenization)
* ["I","love","India"]
*      ↓              (vocab lookup)
* [34, 91, 2563]  ← input_ids
*      ↓
* [embedding vectors] ← actual embeddings used by model

In [29]:
tokenizer = AutoTokenizer.from_pretrained(ckpt) # Enter the Token and Rerun

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

The repository ai4bharat/indictrans2-en-indic-dist-200M contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/ai4bharat/indictrans2-en-indic-dist-200M .
 You can inspect the repository content at https://hf.co/ai4bharat/indictrans2-en-indic-dist-200M.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


tokenization_indictrans.py:   0%|          | 0.00/8.04k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M:
- tokenization_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


dict.SRC.json:   0%|          | 0.00/645k [00:00<?, ?B/s]

dict.TGT.json:   0%|          | 0.00/3.39M [00:00<?, ?B/s]

model.SRC:   0%|          | 0.00/759k [00:00<?, ?B/s]

model.TGT:   0%|          | 0.00/3.26M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

In [30]:
# sample Example;
text = "Hello Myself Virendra. A final year student at NIT Surat."
tokenizer("eng_Latn hin_Deva " + text)

{'input_ids': [4, 15, 7951, 23463, 8660, 11258, 5933, 85, 55, 910, 195, 1410, 48, 349, 6601, 8308, 85, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [31]:
# Tokenize the Target language(hindi - hin_Deva) - Example
with tokenizer.as_target_tokenizer():
    print(tokenizer("मधुमक्खियों को पालने का कार्य। "))

{'input_ids': [53995, 62658, 458, 23, 55780, 31, 353, 77606, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}




 # Preprocess Fxn for Tokenization:

In [32]:
# tags for IndicTrans2 English->Hindi
SRC_TAG = "eng_Latn"
TGT_TAG = "hin_Deva"

max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "hi"

def preprocess_function(examples):

    inputs = [f"{SRC_TAG} {TGT_TAG} {ex[source_lang].strip() if ex[source_lang] else ''}"
              for ex in examples["translation"]]
    targets = [ex[target_lang].strip() if ex[target_lang] else "" 
               for ex in examples["translation"]]

    # tokenize source (each string already prefixed with tags)
    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding=True,   
    )

    # Tokenizer for Target lang(Hindi)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            [f"{TGT_TAG} {t}" if t else f"{TGT_TAG} " for t in targets],
            max_length=max_target_length,
            truncation=True,
            padding=True,
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [33]:
# Obtaining token of two rows of Train dataset
print(preprocess_function(raw_dataset["train"][:2])) 


{'input_ids': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 15, 18, 7373, 8, 2889, 21033, 85, 2], [4, 15, 21940, 6, 343, 8, 18179, 3, 349, 16651, 784, 1339, 9, 181, 1401, 40, 6, 14911, 241, 6, 2224, 1750, 8083, 103, 885, 11, 8748, 85, 2]], 'attention_mask': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[105948, 59836, 2134, 5172, 43144, 53995, 62658, 458, 23, 55780, 31, 353, 77606, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [105948, 59836, 2134, 5172, 43144, 40041, 9, 1036, 10, 1094, 26, 693, 10630, 9, 10294, 56705, 1682, 5252, 2949, 105, 9, 173, 10294, 12640, 20, 1251, 101, 31, 1543, 1964, 78, 77606, 2]]}


In [34]:
# Setting  hyperparameter Values
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

In [35]:
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

Device: cuda


In [47]:
# 1) Tokenization of dataset and DataCollator
tokenized_datasets = raw_dataset.map(preprocess_function, batched=True,
                                    remove_columns=raw_dataset["train"].column_names)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt")

# 2)train_dataloader and test_dataloader

train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=batch_size,
                              shuffle=True, collate_fn=data_collator, num_workers=4)
validation_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=batch_size,
                                   shuffle=False, collate_fn=data_collator, num_workers=2)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



Map:   0%|          | 0/150 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [37]:
model.float()        # make sure parameters are float32
model.to(device)

# 3) Recreate optimizer (must be created after model param dtypes are finalized)
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

# 4) Mixed precision utilities
scaler = GradScaler()
grad_accum = 2
num_epochs = max(1, int(num_train_epochs))

allowed_keys = {"input_ids", "attention_mask", "labels", "decoder_input_ids", "decoder_attention_mask"}

In [38]:
# 5) Training loop
model.train()
for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")
    running_loss = 0.0
    steps = 0
    for batch in tqdm(train_dataloader, desc="Training"):
        # keep only model inputs and move to device
        model_batch = {k: v.to(device) for k, v in batch.items() if k in allowed_keys and isinstance(v, torch.Tensor)}

        with autocast("cuda"):                  # activations in fp16, params stay fp32
            outputs = model(**model_batch)
            loss = outputs.loss / grad_accum

        scaler.scale(loss).backward()

        if (steps + 1) % grad_accum == 0:
            # Unscale the Optimizer
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        running_loss += loss.item() * grad_accum
        steps += 1

    avg_train = running_loss / max(1, steps)
    print(f"Train loss: {avg_train:.4f}")

    # 6) Quick validation
    model.eval()
    vloss = 0.0
    vsteps = 0
    with torch.no_grad():
        for vb in tqdm(validation_dataloader, desc="Validation"):
            vb_batch = {k: v.to(device) for k, v in vb.items() if k in allowed_keys and isinstance(v, torch.Tensor)}
            with autocast("cuda"):
                out = model(**vb_batch)
            vloss += out.loss.item()
            vsteps += 1
    vloss = vloss / max(1, vsteps)
    print(f"Validation loss: {vloss:.4f}")
    model.train()

print("Training finished.")



Epoch 1/1


Training:   0%|          | 0/125 [00:00<?, ?it/s]

Train loss: 9.3129


Validation:   0%|          | 0/10 [00:00<?, ?it/s]

Validation loss: 6.6893
Training finished.


In [39]:
# simple inference (model already on GPU by device_map)
SRC_TAG = "eng_Latn"
TGT_TAG = "hin_Deva"
text = "My name is Virendra, I am Currently a AI/ML Researcher"
tagged = f"{SRC_TAG} {TGT_TAG} {text}"

inputs = tokenizer(tagged, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_length=128, num_beams=4)
print(tokenizer.decode(out[0], skip_special_tokens=True))

मेरा नाम विरेन्द्र है मैं वर्तमान में एक ए. आई./एम. एल. शोधकर्ता हूँ


# Evaluation metrics

# Finding  BLEU AND CHRF:

1. BLEU: BLEU checks how many n-grams from the candidate sentence also appear in the reference sentence.
2. CHRF: Instead of words, CHRF compares character n-grams.

In [40]:
model.device

device(type='cuda', index=0)

In [41]:
n_samples = 10

# determine model device safely
try:
    model_device = next(model.parameters()).device
except StopIteration:
    model_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Model device:", model_device)

preds = []
refs = []

for i in range(min(n_samples, len(raw_dataset["test"]))):
    row = raw_dataset["test"][i]

    # handle multiple possible row formats
    if isinstance(row, dict) and "translation" in row:
        trans = row["translation"]
        # trans might be a dict or a string; handle both
        if isinstance(trans, dict):
            eng = trans.get("en") or trans.get("eng") or ""
            hi_ref = trans.get("hi") or trans.get("hin") or ""
        else:
            # sometimes translation is a string (rare) — treat as source
            eng = str(trans)
            hi_ref = ""
    elif isinstance(row, dict):
        # maybe keys are directly 'en' and 'hi'
        eng = row.get("en") or row.get("eng") or row.get("source") or ""
        hi_ref = row.get("hi") or row.get("hin") or row.get("target") or ""
    else:
        # fallback: row itself might be the translation dict-like
        try:
            eng = row["en"]
            hi_ref = row["hi"]
        except Exception:
            # last resort: stringify
            eng = str(row)
            hi_ref = ""

    eng = (eng or "").strip()
    hi_ref = (hi_ref or "").strip()
    refs.append(hi_ref if hi_ref else "")  # keep alignment

    # add required tags
    tagged = f"{SRC_TAG} {TGT_TAG} {eng}"

    # tokenize -> torch tensors -> move to model device
    tokenized = tokenizer(tagged, return_tensors="pt", padding=True, truncation=True, max_length=128)
    tokenized = {k: v.to(model_device) for k, v in tokenized.items()}

    # generate
    with torch.no_grad():
        out_ids = model.generate(**tokenized, max_length=128, num_beams=4, early_stopping=True)

    pred_text = tokenizer.decode(out_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True).strip()
    preds.append(pred_text)


Model device: cuda:0


In [42]:
#BLEU:
bleu = sacrebleu.corpus_bleu(preds, [refs])
#CHRF:
chrf = sacrebleu.corpus_chrf(preds, [refs])

print(f"\nEvaluated {len(preds)} samples")
print("BLEU:", bleu.score)
print("CHRF:", chrf.score)

# show a few examples
for i in range(min(5, len(preds))):
    print(f"\n=== SAMPLE {i+1} ===")
    print("SRC :", (raw_dataset["test"][i].get("translation", raw_dataset["test"][i]).get("en")
                    if isinstance(raw_dataset["test"][i], dict) and "translation" in raw_dataset["test"][i]
                    else (raw_dataset["test"][i].get("en") if isinstance(raw_dataset["test"][i], dict) else str(raw_dataset["test"][i]))))
    print("PRED:", preds[i])
    print("REF :", refs[i])



Evaluated 10 samples
BLEU: 19.117911694708003
CHRF: 46.351201054120544

=== SAMPLE 1 ===
SRC : A black box in your car?
PRED: आपकी गाड़ी में एक ब्लैक बॉक्स
REF : आपकी कार में ब्लैक बॉक्स?

=== SAMPLE 2 ===
SRC : As America's road planners struggle to find the cash to mend a crumbling highway system, many are beginning to see a solution in a little black box that fits neatly by the dashboard of your car.
PRED: जैसे - जैसे अमेरिका के सड़क योजनाकारों को ढहती हुई राजमार्ग प्रणाली को ठीक करने के लिए नकदी खोजने में परेशानी हो रही है और कई लोगों को एक छोटे से ब्लैक बॉक्स में एक समाधान दिखाई देने लगा है जो आपकी कार के डैशबोर्ड पर अच्छी तरह से फिट बैठता है ।
REF : जबकि अमेरिका के सड़क योजनाकार, ध्वस्त होते हुए हाईवे सिस्टम को सुधारने के लिए धन की कमी से जूझ रहे हैं, वहीं बहुत-से लोग इसका समाधान छोटे से ब्लैक बॉक्स में देख रहे हैं, जो आपकी कार के डैशबोर्ड पर सफ़ाई से फिट हो जाता है।

=== SAMPLE 3 ===
SRC : The devices, which track every mile a motorist drives and transmit that information to bu

# Finding BERTScore:

* BERTScore: Uses BERT (or RoBERTa, or mBERT) embeddings to compare every token in candidate with every token in reference.

In [43]:
!pip install bert-score
from bert_score import score

P, R, F1 = score(preds, refs, lang="hi")
print("BERTScore F1:", F1.mean().item())

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.0.

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

BERTScore F1: 0.847461998462677


# Finding BLEURT:
* BLEURT computes a similarity score using a fine-tuned BERT model that predicts human judgment of translation quality.

In [44]:
# Required Packages for Bleurt
!pip install evaluate
!pip install git+https://github.com/google-research/bleurt.git

# Calculating Bleurt
import evaluate
bleurt = evaluate.load("bleurt")
results = bleurt.compute(predictions=preds, references=refs)
print(results)

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-xnqvdi_b
  Running command git clone --filter=blob:none --quiet https://github.com/google-research/bleurt.git /tmp/pip-req-build-xnqvdi_b
  Resolved https://github.com/google-research/bleurt.git to commit cebe7e6f996b40910cfaa520a63db47807e3bf5c
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456766 sha256=3efc3b719a0c8943c287439257901

Downloading builder script: 0.00B [00:00, ?B/s]



Downloading data:   0%|          | 0.00/405M [00:00<?, ?B/s]

INFO:tensorflow:Reading checkpoint /root/.cache/huggingface/metrics/bleurt/default/downloads/extracted/887f2dc36c17f53c287f696681b8f7c947278407c1cf9f226662e16c8c0dc417/bleurt-base-128.


INFO:tensorflow:Reading checkpoint /root/.cache/huggingface/metrics/bleurt/default/downloads/extracted/887f2dc36c17f53c287f696681b8f7c947278407c1cf9f226662e16c8c0dc417/bleurt-base-128.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Will load checkpoint bert_custom


INFO:tensorflow:Will load checkpoint bert_custom


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:... name:bert_custom


INFO:tensorflow:... name:bert_custom


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... max_seq_length:128


INFO:tensorflow:... max_seq_length:128


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Loading model.


INFO:tensorflow:Loading model.
I0000 00:00:1765531228.650015      47 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2704 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1765531228.650782      47 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13942 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


INFO:tensorflow:BLEURT initialized.


INFO:tensorflow:BLEURT initialized.


{'scores': [0.47538262605667114, -0.05329755321145058, 0.06402337551116943, 0.08614060282707214, -0.1480509340763092, 0.4468385875225067, 0.14272350072860718, 0.09977954626083374, 0.04155030474066734, 0.14204412698745728]}


# Finding COMET
* COMET predicts a score that strongly correlates with human judgment.

In [45]:
# Required package for Comet
!pip install -q unbabel-comet
from comet import download_model, load_from_checkpoint


# choose model variable 
translation_model = globals().get("translation_model", None) or globals().get("model", None)
if translation_model is None:
    raise ValueError("No translation model found. Load your model into `model` or `translation_model` first.")

# device: try to get model device (handles DeviceMap too)
try:
    model_device = next(translation_model.parameters()).device
except StopIteration:
    model_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

SRC_TAG = "eng_Latn"
TGT_TAG = "hin_Deva"

srcs = []
preds = []
refs = []

n = 10
for i in range(min(n, len(raw_dataset["test"]))):
    row = raw_dataset["test"][i]

    # robust extraction of source & reference
    if isinstance(row, dict) and "translation" in row:
        trans = row["translation"]
        if isinstance(trans, dict):
            src = (trans.get("en") or trans.get("eng") or "").strip()
            ref = (trans.get("hi") or trans.get("hin") or "").strip()
        else:
            src = str(trans).strip()
            ref = ""
    elif isinstance(row, dict):
        src = (row.get("en") or row.get("eng") or row.get("source") or "").strip()
        ref = (row.get("hi") or row.get("hin") or row.get("target") or "").strip()
    else:
        # fallback
        src = str(row).strip()
        ref = ""

    srcs.append(src)
    refs.append(ref)

    # add language tags required by IndicTrans2
    tagged = f"{SRC_TAG} {TGT_TAG} {src}"

    # tokenize -> PyTorch tensors -> move to model device
    tokenized = tokenizer(tagged,
                          return_tensors="pt",
                          truncation=True,
                          padding=True,
                          max_length=128)
    tokenized = {k: v.to(model_device) for k, v in tokenized.items()}

    # generate
    with torch.no_grad():
        out_ids = translation_model.generate(**tokenized, max_length=128, num_beams=4, early_stopping=True)

    pred = tokenizer.decode(out_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True).strip()
    preds.append(pred)

# ---- COMET evaluation ----
model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(model_path)

# prepare data list for COMET
data = [{"src": s, "mt": p, "ref": r} for s, p, r in zip(srcs, preds, refs)]
# note: comet_model.predict returns a dict; 'scores' contains numeric values
comet_out = comet_model.predict(data, batch_size=8)
comet_scores = comet_out["scores"] if isinstance(comet_out, dict) and "scores" in comet_out else comet_out

print("Samples evaluated:", len(preds))
print("Mean COMET score:", float(sum(comet_scores) / len(comet_scores)))

# quick side-by-side preview
for i in range(len(preds)):
    print(f"\n--- SAMPLE {i+1} ---")
    print("SRC :", srcs[i])
    print("PRED:", preds[i])
    print("REF :", refs[i])


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.0/91.0 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.4/101.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m529.7/529.7 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
opentelemetry-proto 1.37.0 requires protobuf<7.0,>=5.0, but you have protobuf 4.25.8 which is incompatible.
a2a-sdk 0.

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

checkpoints/model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

LICENSE: 0.00B [00:00, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
INFO:pytorch_lightning.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  4.72it/s]


Samples evaluated: 10
Mean COMET score: 0.7581199586391449

--- SAMPLE 1 ---
SRC : A black box in your car?
PRED: आपकी कार में एक काला बॉक्स
REF : आपकी कार में ब्लैक बॉक्स?

--- SAMPLE 2 ---
SRC : As America's road planners struggle to find the cash to mend a crumbling highway system, many are beginning to see a solution in a little black box that fits neatly by the dashboard of your car.
PRED: जैसे ही अमेरिका के सड़क योजनाकार एक जर्जर राजमार्ग प्रणाली को ठीक करने के लिए नकदी खोजने के लिए संघर्ष कर रहे हैं, कई लोग एक छोटे से ब्लैक बॉक्स में एक समाधान देखने लगे हैं जो आपकी कार के डैशबोर्ड के साथ अच्छी तरह से फिट बैठता है ।
REF : जबकि अमेरिका के सड़क योजनाकार, ध्वस्त होते हुए हाईवे सिस्टम को सुधारने के लिए धन की कमी से जूझ रहे हैं, वहीं बहुत-से लोग इसका समाधान छोटे से ब्लैक बॉक्स में देख रहे हैं, जो आपकी कार के डैशबोर्ड पर सफ़ाई से फिट हो जाता है।

--- SAMPLE 3 ---
SRC : The devices, which track every mile a motorist drives and transmit that information to bureaucrats, are at the center 

In [46]:
# Saving the Model and tokenizer
model.save_pretrained("pt_model")
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/dict.SRC.json',
 'tokenizer/dict.TGT.json',
 'tokenizer/model.SRC',
 'tokenizer/model.TGT',
 'tokenizer/added_tokens.json')