In [6]:
!pip install sacrebleu datasets transformers SentencePiece sacremoses

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading coloram

BLEU measures n-gram overlap between:

The machine-generated translation (called hypothesis), and

The reference translation(s) (human-annotated ground truth).

It checks how many words, word pairs, triplets, etc. match between them.

Reference:  The cat is on the mat
Hypothesis: The cat sat on the mat

Unigrams (n=1):

Overlap words: “The”, “cat”, “on”, “the”, “mat” → 5 matches out of 6 in hypothesis

Bigrams (n=2):

Reference: "The cat", "cat is", "is on", "on the", "the mat"

Hypothesis: "The cat", "cat sat", "sat on", "on the", "the mat"

Overlaps: "The cat", "on the", "the mat" → 3 matches out of 5

Apply brevity penalty (BP)

If the hypothesis is too short, it gets penalized.


In [7]:
import os
import re
import random
import torch
from datasets import load_dataset
from transformers import MarianMTModel, MarianTokenizer
import sacrebleu
from tqdm import tqdm


# Config

MODEL_ID = "Helsinki-NLP/opus-mt-en-hi"
DATASET_ID = "cfilt/iitb-english-hindi"
SPLIT = "test"

# Full test evaluation
EVAL_SIZE = None
BATCH_SIZE = 64
MAX_NEW_TOKENS = 128
NUM_BEAMS = 6
LENGTH_PENALTY = 1.1     #Adjusts how much the model is penalized for producing long sequences in beam search.
NO_REPEAT_NGRAM = 2   #Prevents the model from repeating the same 3-word sequence during generation
SEED = 42

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
USE_FP16 = DEVICE == "cuda"   # half-precision on GPU

In [8]:

# Reproducibility & backend

random.seed(SEED)
torch.manual_seed(SEED)
if DEVICE == "cuda":
    torch.backends.cudnn.benchmark = True


# Load model & tokenizer

print(f"Loading model: {MODEL_ID} on {DEVICE} (fp16={USE_FP16})")
tokenizer = MarianTokenizer.from_pretrained(MODEL_ID)
model = MarianMTModel.from_pretrained(MODEL_ID)
if USE_FP16:
    model = model.half()
model = model.to(DEVICE).eval()

Loading model: Helsinki-NLP/opus-mt-en-hi on cuda (fp16=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/306M [00:00<?, ?B/s]

In [9]:

# Load dataset

print(f"Loading dataset: {DATASET_ID} [{SPLIT}]")
ds = load_dataset(DATASET_ID, split=SPLIT)

def extract_pair(example):
    tr = example["translation"]
    return {"src": tr["en"].strip(), "ref": tr["hi"].strip()}

ds = ds.map(extract_pair, remove_columns=ds.column_names)

# Show first 5 raw examples
print("\n=== First 5 examples from IITB English-Hindi (test) ===")
for i in range(min(5, len(ds))):
    print(f"\nExample {i+1}:")
    print("SRC:", ds[i]["src"])
    print("REF:", ds[i]["ref"])

# Safety check (we want full test)
if EVAL_SIZE is not None and EVAL_SIZE < len(ds):
    raise ValueError("EVAL_SIZE must be None to evaluate the entire test dataset.")

srcs = ds["src"]
refs = ds["ref"]

Loading dataset: cfilt/iitb-english-hindi [test]


README.md: 0.00B [00:00, ?B/s]

dataset_infos.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/85.7k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/500k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1659083 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/520 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2507 [00:00<?, ? examples/s]

Map:   0%|          | 0/2507 [00:00<?, ? examples/s]


=== First 5 examples from IITB English-Hindi (test) ===

Example 1:
SRC: A black box in your car?
REF: आपकी कार में ब्लैक बॉक्स?

Example 2:
SRC: As America's road planners struggle to find the cash to mend a crumbling highway system, many are beginning to see a solution in a little black box that fits neatly by the dashboard of your car.
REF: जबकि अमेरिका के सड़क योजनाकार, ध्वस्त होते हुए हाईवे सिस्टम को सुधारने के लिए धन की कमी से जूझ रहे हैं, वहीं बहुत-से लोग इसका समाधान छोटे से ब्लैक बॉक्स में देख रहे हैं, जो आपकी कार के डैशबोर्ड पर सफ़ाई से फिट हो जाता है।

Example 3:
SRC: The devices, which track every mile a motorist drives and transmit that information to bureaucrats, are at the center of a controversial attempt in Washington and state planning offices to overhaul the outdated system for funding America's major roads.
REF: यह डिवाइस, जो मोटर-चालक द्वारा वाहन चलाए गए प्रत्येक मील को ट्रैक करती है तथा उस सूचना को अधिकारियों को संचारित करती है, आजकल अमेरिका की प्रमुख सड़कों का वि

In [14]:

# Translation (beam search)

@torch.inference_mode()
def translate_batch(texts):
    enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    for k in enc:
        enc[k] = enc[k].to(DEVICE)
    gen = model.generate(
        **enc,
        max_new_tokens=MAX_NEW_TOKENS,
        num_beams=NUM_BEAMS,
        length_penalty=LENGTH_PENALTY,
        no_repeat_ngram_size=NO_REPEAT_NGRAM,
        use_cache=True,
    )
    out = tokenizer.batch_decode(gen, skip_special_tokens=True)
    return [o.strip() for o in out]

print("\nTranslating… (full test set)")
hyps = []
for i in tqdm(range(0, len(srcs), BATCH_SIZE)):
    batch_src = srcs[i:i+BATCH_SIZE]
    hyps.extend(translate_batch(batch_src))

assert len(hyps) == len(refs)


# Light normalization (Hindi)
#is to clean and standardize Hindi text so that small, meaningless differences
#(like spacing or punctuation inconsistencies) don’t unfairly affect evaluation metrics such as BLEU

def normalize_hi(s: str) -> str:
    s = s.strip()
    # unify danda/full stop spacing and collapse spaces
    s = s.replace(" ।", "।").replace(" .", ".")
    s = re.sub(r"\s+", " ", s)
    return s

hyps = [normalize_hi(x) for x in hyps]
refs = [normalize_hi(x) for x in refs]



Translating… (full test set)


100%|██████████| 40/40 [04:37<00:00,  6.94s/it]


In [18]:
my_english_sentences = [
    "Hello, how are you?",
    "This is an example sentence.",
    "Machine translation is interesting.",
    "Ruksad loves to learn machine learning"
]

hindi_translations = translate_batch(my_english_sentences)

for original, translated in zip(my_english_sentences, hindi_translations):
    print(f"English: {original}")
    print(f"Hindi: {translated}\n")

English: Hello, how are you?
Hindi: हैलो, तुम कैसे हो?

English: This is an example sentence.
Hindi: यह एक उदाहरण वाक्य है ।

English: Machine translation is interesting.
Hindi: मशीन अनुवाद दिलचस्प है.

English: Ruksad loves to learn machine learning
Hindi: रुडड मशीन सीखने के लिए प्यार करता है



In [19]:





# Corpus BLEU (13a and intl)

print("\nComputing corpus BLEU (sacrebleu)…")
bleu_13a  = sacrebleu.corpus_bleu(hyps, [refs], tokenize="13a")
bleu_intl = sacrebleu.corpus_bleu(hyps, [refs], tokenize="intl")

print("\n=== Corpus BLEU (EN→HI, IITB full test) ===")
print(f"BLEU 13a : {bleu_13a.score:.2f}")
print(f"BLEU intl: {bleu_intl.score:.2f}  (recommended for Indic)")


# Sentence-level ranking (BLEU intl)

print("\nScoring sentences (sentence-level BLEU, intl)…")
sent_bleu_scores = [
    sacrebleu.sentence_bleu(h, [r], tokenize="intl").score
    for h, r in zip(hyps, refs)
]

def topk_indices(scores, k, reverse=True):
    return sorted(range(len(scores)), key=lambda i: scores[i], reverse=reverse)[:k]

K = 5
best_idx = topk_indices(sent_bleu_scores, K, reverse=True)
worst_idx = topk_indices(sent_bleu_scores, K, reverse=False)

def pretty_show(indices, title):
    print(f"\n=== {title} (by sentence BLEU, intl) ===")
    for rank, i in enumerate(indices, 1):
        print(f"\n[{rank}] BLEU={sent_bleu_scores[i]:.2f}")
        print(f"SRC: {srcs[i]}")
        print(f"REF: {refs[i]}")
        print(f"HYP: {hyps[i]}")

pretty_show(best_idx, "Top 5 correct examples")
pretty_show(worst_idx, "Bottom 5 bad-performing examples")


# Save outputs (optional)

out_dir = "eval_outputs_en_hi_iitb_full_bleu"
os.makedirs(out_dir, exist_ok=True)
with open(os.path.join(out_dir, "sources.txt"), "w", encoding="utf-8") as f:
    f.write("\n".join(srcs))
with open(os.path.join(out_dir, "references.txt"), "w", encoding="utf-8") as f:
    f.write("\n".join(refs))
with open(os.path.join(out_dir, "hypotheses.txt"), "w", encoding="utf-8") as f:
    f.write("\n".join(hyps))

print(f"\nSaved to: {out_dir}/ (sources.txt, references.txt, hypotheses.txt)")
print("\nDone.")


Computing corpus BLEU (sacrebleu)…

=== Corpus BLEU (EN→HI, IITB full test) ===
BLEU 13a : 9.79
BLEU intl: 10.27  (recommended for Indic)

Scoring sentences (sentence-level BLEU, intl)…

=== Top 5 correct examples (by sentence BLEU, intl) ===

[1] BLEU=78.25
SRC: This project is a key element of energy security of the whole European continent.
REF: यह परियोजना पूरे यूरोपीय महाद्वीप की ऊर्जा सुरक्षा का एक मुख्य तत्व है।
HYP: यह परियोजना पूरे यूरोपीय महाद्वीप की ऊर्जा सुरक्षा का एक प्रमुख तत्व है।

[2] BLEU=75.98
SRC: What was the cause of death?
REF: मौत का कारण क्या था?
HYP: मृत्यु का कारण क्या था?

[3] BLEU=74.48
SRC: Someone in the German parliament says we should build a German Google.
REF: जर्मन संसद में किसी ने कहा की हमे एक जर्मन गूगल का निर्माण करना चाहिए।
HYP: जर्मन संसद में किसी ने कहा कि हम एक जर्मन गूगल का निर्माण करना चाहिए।

[4] BLEU=70.17
SRC: Frontier has gone the furthest in this area, though.
REF: Frontier हालांकि, इस क्षेत्र में आगे चला गया है.
HYP: हालांकि, इस क्षेत

In [20]:
"""
Fast IITB EN→HI evaluation with Helsinki-NLP/opus-mt-en-hi
- Prints first 5 dataset examples
- FULL test evaluation (no subsetting)
- BLEU via sacrebleu (corpus + sentence-level)
- 5 best / 5 worst examples by sentence BLEU
"""

import os
import random
import torch
from datasets import load_dataset
from transformers import MarianMTModel, MarianTokenizer
import sacrebleu
from tqdm import tqdm


# Config (tune these for speed/quality)

MODEL_ID = "Helsinki-NLP/opus-mt-en-hi"
DATASET_ID = "cfilt/iitb-english-hindi"
SPLIT = "test"

# Evaluate FULL test set
EVAL_SIZE = None          # <-- full test; do not change unless you want a subset
BATCH_SIZE = 64           # adjust for your hardware
MAX_NEW_TOKENS = 96
NUM_BEAMS = 1             # 1 = greedy (keep same behavior)
NO_REPEAT_NGRAM = 0
SEED = 42

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
USE_FP16 = DEVICE == "cuda"  # enable half-precision on GPU


# Reproducibility & backend

random.seed(SEED)
torch.manual_seed(SEED)
if DEVICE == "cuda":
    torch.backends.cudnn.benchmark = True


# Load model & tokenizer

print(f"Loading model: {MODEL_ID} on {DEVICE} (fp16={USE_FP16})")
tokenizer = MarianTokenizer.from_pretrained(MODEL_ID)
model = MarianMTModel.from_pretrained(MODEL_ID)
if USE_FP16:
    model = model.half()
model = model.to(DEVICE).eval()


# Load dataset

print(f"Loading dataset: {DATASET_ID} [{SPLIT}]")
ds = load_dataset(DATASET_ID, split=SPLIT)

def extract_pair(example):
    tr = example["translation"]
    return {"src": tr["en"].strip(), "ref": tr["hi"].strip()}

ds = ds.map(extract_pair, remove_columns=ds.column_names)

# Show first 5 raw examples
print("\n=== First 5 examples from IITB English-Hindi (test) ===")
for i in range(5):
    print(f"\nExample {i+1}:")
    print("SRC:", ds[i]["src"])
    print("REF:", ds[i]["ref"])

# (No subsetting; evaluate full test set)
if EVAL_SIZE is not None and EVAL_SIZE < len(ds):
    raise ValueError("EVAL_SIZE is not None; set it to None to evaluate the entire test dataset.")

srcs = ds["src"]
refs = ds["ref"]

# Batched translation

@torch.inference_mode()
def translate_batch(texts):
    enc = tokenizer(
        texts, return_tensors="pt", padding=True, truncation=True
    )
    # Move to device
    for k in enc:
        enc[k] = enc[k].to(DEVICE)
    gen = model.generate(
        **enc,
        max_new_tokens=MAX_NEW_TOKENS,
        num_beams=NUM_BEAMS,
        length_penalty=1.0,
        no_repeat_ngram_size=NO_REPEAT_NGRAM,
        use_cache=True,
    )
    out = tokenizer.batch_decode(gen, skip_special_tokens=True)
    return [o.strip() for o in out]

print("\nTranslating… (full test set)")
hyps = []
for i in tqdm(range(0, len(srcs), BATCH_SIZE)):
    batch_src = srcs[i:i+BATCH_SIZE]
    hyps.extend(translate_batch(batch_src))

assert len(hyps) == len(refs)


# Corpus BLEU only

print("\nComputing corpus BLEU (sacrebleu)…")
bleu = sacrebleu.corpus_bleu(hyps, [refs])

print("\n=== Corpus BLEU (EN→HI, IITB full test) ===")
print(f"BLEU: {bleu.score:.2f}")


# Sentence-level ranking (BLEU)

print("\nScoring sentences (sentence-level BLEU)…")
# sacrebleu.sentence_bleu returns an object with .score
sent_bleu_scores = [sacrebleu.sentence_bleu(h, [r]).score for h, r in zip(hyps, refs)]

def topk_indices(scores, k, reverse=True):
    return sorted(range(len(scores)), key=lambda i: scores[i], reverse=reverse)[:k]

K = 5
best_idx = topk_indices(sent_bleu_scores, K, reverse=True)
worst_idx = topk_indices(sent_bleu_scores, K, reverse=False)

def pretty_show(indices, title):
    print(f"\n=== {title} (by sentence BLEU) ===")
    for rank, i in enumerate(indices, 1):
        print(f"\n[{rank}] BLEU={sent_bleu_scores[i]:.2f}")
        print(f"SRC: {srcs[i]}")
        print(f"REF: {refs[i]}")
        print(f"HYP: {hyps[i]}")

pretty_show(best_idx, "Top 5 correct examples")
pretty_show(worst_idx, "Bottom 5 bad-performing examples")


# Save outputs (optional)

out_dir = "eval_outputs_en_hi_iitb_full_bleu"
os.makedirs(out_dir, exist_ok=True)
with open(os.path.join(out_dir, "sources.txt"), "w", encoding="utf-8") as f:
    f.write("\n".join(srcs))
with open(os.path.join(out_dir, "references.txt"), "w", encoding="utf-8") as f:
    f.write("\n".join(refs))
with open(os.path.join(out_dir, "hypotheses.txt"), "w", encoding="utf-8") as f:
    f.write("\n".join(hyps))

print(f"\nSaved to: {out_dir}/ (sources.txt, references.txt, hypotheses.txt)")
print("\nDone.")


Loading model: Helsinki-NLP/opus-mt-en-hi on cuda (fp16=True)
Loading dataset: cfilt/iitb-english-hindi [test]


Map:   0%|          | 0/2507 [00:00<?, ? examples/s]


=== First 5 examples from IITB English-Hindi (test) ===

Example 1:
SRC: A black box in your car?
REF: आपकी कार में ब्लैक बॉक्स?

Example 2:
SRC: As America's road planners struggle to find the cash to mend a crumbling highway system, many are beginning to see a solution in a little black box that fits neatly by the dashboard of your car.
REF: जबकि अमेरिका के सड़क योजनाकार, ध्वस्त होते हुए हाईवे सिस्टम को सुधारने के लिए धन की कमी से जूझ रहे हैं, वहीं बहुत-से लोग इसका समाधान छोटे से ब्लैक बॉक्स में देख रहे हैं, जो आपकी कार के डैशबोर्ड पर सफ़ाई से फिट हो जाता है।

Example 3:
SRC: The devices, which track every mile a motorist drives and transmit that information to bureaucrats, are at the center of a controversial attempt in Washington and state planning offices to overhaul the outdated system for funding America's major roads.
REF: यह डिवाइस, जो मोटर-चालक द्वारा वाहन चलाए गए प्रत्येक मील को ट्रैक करती है तथा उस सूचना को अधिकारियों को संचारित करती है, आजकल अमेरिका की प्रमुख सड़कों का वि

100%|██████████| 40/40 [00:28<00:00,  1.40it/s]



Computing corpus BLEU (sacrebleu)…

=== Corpus BLEU (EN→HI, IITB full test) ===
BLEU: 8.96

Scoring sentences (sentence-level BLEU)…

=== Top 5 correct examples (by sentence BLEU) ===

[1] BLEU=100.00
SRC: What was the cause of death?
REF: मौत का कारण क्या था?
HYP: मौत का कारण क्या था?

[2] BLEU=69.31
SRC: This project is a key element of energy security of the whole European continent.
REF: यह परियोजना पूरे यूरोपीय महाद्वीप की ऊर्जा सुरक्षा का एक मुख्य तत्व है।
HYP: यह परियोजना पूरे यूरोपीय महाद्वीप की ऊर्जा सुरक्षा का एक प्रमुख तत्व है ।

[3] BLEU=67.32
SRC: What about the first?
REF: पहले वाले के बारे में क्या?
HYP: पहले के बारे में क्या?

[4] BLEU=63.89
SRC: And I think about my father.
REF: और मैं अपने पिता के बारे में सोचता हूँ।
HYP: और मैं अपने पिता के बारे में सोचते हैं.

[5] BLEU=60.04
SRC: Share with us your thoughts in the comments below.
REF: अपने विचार नीचे टिप्पणियों में हमारे साथ साझा करें।
HYP: नीचे टिप्पणियों में हमारे साथ साझा करें.

=== Bottom 5 bad-performing examp