# In this Notebook:- 
* We fine-tuned the IndicTrans2--indic-en-dist-200M model on the PHINC (Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation) for the HINGLISH -> EN translation task. The resulting model was evaluated using a comprehensive suite of MT quality metrics, including BLEU, ChrF, COMET, BERTScore, and BLEURT.* 

[](http://)

In [11]:
!pip uninstall -y numpy scipy pandas pyarrow datasets transformers
!pip install --force-reinstall --no-cache-dir \
  numpy==1.26.4 \
  scipy==1.11.4 \
  pandas==2.1.4 \
  pyarrow==14.0.2 \
  datasets==2.16.1 \
  transformers==4.36.2


Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Found existing installation: scipy 1.11.4
Uninstalling scipy-1.11.4:
  Successfully uninstalled scipy-1.11.4
Found existing installation: pandas 2.1.4
Uninstalling pandas-2.1.4:
  Successfully uninstalled pandas-2.1.4
Found existing installation: pyarrow 14.0.2
Uninstalling pyarrow-14.0.2:
  Successfully uninstalled pyarrow-14.0.2
Found existing installation: datasets 2.16.1
Uninstalling datasets-2.16.1:
  Successfully uninstalled datasets-2.16.1
Found existing installation: transformers 4.36.2
Uninstalling transformers-4.36.2:
  Successfully uninstalled transformers-4.36.2
Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy==1.11.4
  Downloading scipy-1.11.4-cp31

In [12]:
#Checking wheather GPU is working or not
!nvidia-smi


Thu Dec 18 16:31:03 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

In [13]:
# installing dataset and transformer
!pip install datasets transformers[sentencepiece] sacrebleu -q

In [14]:
# to remove version conflict of Protobuf so, downgrade version of Protobuf
!pip install protobuf==3.20.3 



In [15]:
# Importing all required modules
import os
import sys
import transformers
import torch  # pytorch Import
import sacrebleu
from torch.amp import autocast, GradScaler
from tqdm.auto import tqdm
from transformers import DataCollatorForSeq2Seq
from torch.utils.data import DataLoader
from torch.optim import AdamW
from datasets import load_dataset # for loading the dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # For getting Embedding
from transformers import DataCollatorForSeq2Seq #getting sequential model and collator for loading batchwise of data
from torch.optim import AdamW # Optimizer



# Indictrans2-en-indic-dist-200M Model
* source: https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M

In [16]:
# Enter Access Token and rerun
from huggingface_hub import login
login(new_session=False)

# Note:
* I was using the free version of Kaggle, and the memory limit was getting exhausted while training the 1B-parameter model. Because of this constraint, I switched to using the 200M-parameter model instead.

In [17]:
ckpt = "ai4bharat/indictrans2-indic-en-dist-200M" # Model Checkpoint 

model = AutoModelForSeq2SeqLM.from_pretrained(
    ckpt,
    trust_remote_code=True,                                         
)

tokenizer = AutoTokenizer.from_pretrained(
    ckpt,
    trust_remote_code=True
)

# Move safely to GPU
model = model.to(torch.float16).to("cuda")   



config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

configuration_indictrans.py:   0%|          | 0.00/14.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-dist-200M:
- configuration_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_indictrans.py:   0%|          | 0.00/79.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-dist-200M:
- modeling_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


model.safetensors:   0%|          | 0.00/913M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

tokenization_indictrans.py:   0%|          | 0.00/8.04k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-dist-200M:
- tokenization_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


dict.SRC.json:   0%|          | 0.00/3.39M [00:00<?, ?B/s]

dict.TGT.json:   0%|          | 0.00/645k [00:00<?, ?B/s]

model.SRC:   0%|          | 0.00/3.26M [00:00<?, ?B/s]

model.TGT:   0%|          | 0.00/759k [00:00<?, ?B/s]

# The Dataset¶

* Source: https://huggingface.co/datasets/LingoIITGN/PHINC

In [18]:
raw_dataset = load_dataset("LingoIITGN/PHINC")

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [19]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'English_Translation'],
        num_rows: 13738
    })
})

In [20]:
def convert_to_translation(example):
    return {
        "translation": {
            "en": example["English_Translation"],
            "hing": example["Sentence"]
        }
    }

raw_dataset = raw_dataset.map(convert_to_translation)

Map:   0%|          | 0/13738 [00:00<?, ? examples/s]

In [21]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'English_Translation', 'translation'],
        num_rows: 13738
    })
})

In [22]:
from datasets import DatasetDict

# total rows
total_rows = raw_dataset["train"].num_rows

# required splits
train_rows = 12000
valid_rows = 538
test_rows = 1200

# slice the dataset
train_dataset = raw_dataset["train"].select(range(0, train_rows))
valid_dataset = raw_dataset["train"].select(range(train_rows, train_rows + valid_rows))
test_dataset  = raw_dataset["train"].select(range(train_rows + valid_rows,
                                                  train_rows + valid_rows + test_rows))

# create final DatasetDict
final_dataset = DatasetDict({
    "train": train_dataset,
    "validation": valid_dataset,
    "test": test_dataset
})

final_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'English_Translation', 'translation'],
        num_rows: 12000
    })
    validation: Dataset({
        features: ['Sentence', 'English_Translation', 'translation'],
        num_rows: 538
    })
    test: Dataset({
        features: ['Sentence', 'English_Translation', 'translation'],
        num_rows: 1200
    })
})

In [23]:
raw_dataset = final_dataset

# Observation for Statistics related to dataset

In [24]:
# Import required libraries
import numpy as np
import math
import nltk
nltk.download("punkt")  # one-time

def add_stats(example):
    text = example["translation"]["en"]
    # guard
    if text is None: text = ""
    text = text.strip() # Removes unwanted spacing
    words = text.split()
    # sentence count (approx)
    sents = nltk.tokenize.sent_tokenize(text) if text else []
    example["num_words"] = len(words)
    example["num_chars"] = len(text)
    example["num_sentences"] = len(sents)
    return example

raw_dataset = raw_dataset.map(add_stats, batched=False)

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/538 [00:00<?, ? examples/s]

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [25]:
# Obtaining the Statistics:

def summary_stats(arr):
    arr = np.array(arr)
    return {
        "count": int(arr.size),
        "min": int(arr.min()) if arr.size>0 else None,
        "p1": int(np.percentile(arr, 1)) if arr.size>0 else None,
        "p10": int(np.percentile(arr, 10)) if arr.size>0 else None,
        "median": float(np.median(arr)) if arr.size>0 else None,
        "mean": float(arr.mean()) if arr.size>0 else None,
        "std": float(arr.std(ddof=0)) if arr.size>0 else None,
        "p90": int(np.percentile(arr, 90)) if arr.size>0 else None,
        "p99": int(np.percentile(arr, 99)) if arr.size>0 else None,
        "max": int(arr.max()) if arr.size>0 else None,
    }

for split in raw_dataset:
    d = raw_dataset[split]
    print(f"\n=== {split.upper()} ===")
    print("Words:", summary_stats(d["num_words"]))
    print("Chars:", summary_stats(d["num_chars"]))
    print("Sentences:", summary_stats(d["num_sentences"]))


=== TRAIN ===
Words: {'count': 12000, 'min': 1, 'p1': 1, 'p10': 5, 'median': 11.0, 'mean': 12.335166666666666, 'std': 6.736442926764507, 'p90': 22, 'p99': 31, 'max': 46}
Chars: {'count': 12000, 'min': 1, 'p1': 1, 'p10': 30, 'median': 67.0, 'mean': 74.00516666666667, 'std': 38.514235116887484, 'p90': 130, 'p99': 167, 'max': 278}
Sentences: {'count': 12000, 'min': 1, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.5585, 'std': 0.989062392706682, 'p90': 3, 'p99': 5, 'max': 16}

=== VALIDATION ===
Words: {'count': 538, 'min': 1, 'p1': 1, 'p10': 5, 'median': 11.0, 'mean': 11.962825278810408, 'std': 6.470587802969751, 'p90': 21, 'p99': 28, 'max': 36}
Chars: {'count': 538, 'min': 1, 'p1': 1, 'p10': 28, 'median': 69.0, 'mean': 73.38289962825279, 'std': 39.0196100416231, 'p90': 125, 'p99': 169, 'max': 221}
Sentences: {'count': 538, 'min': 1, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.3717472118959109, 'std': 0.694709906589641, 'p90': 2, 'p99': 4, 'max': 6}

=== TEST ===
Words: {'count': 1200, 'mi

In [26]:
from datasets import DatasetDict

In [27]:
# Train has min length of sentences as 0 ('min': 0) so, we Remove these row from dataset
def not_empty(example):
    text = example["translation"]["en"]
    return text is not None and len(text.strip()) > 0
    
clean_train = raw_dataset["train"].filter(not_empty)
clean_val   = raw_dataset["validation"].filter(not_empty)
clean_test  = raw_dataset["test"].filter(not_empty)

raw_dataset = DatasetDict({
    "train": clean_train,
    "validation": clean_val,
    "test": clean_test
})

Filter:   0%|          | 0/12000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/538 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [28]:
type(raw_dataset)

datasets.dataset_dict.DatasetDict

In [29]:
# compute p99 threshold('p99': 31 so, removing other outliers(longer than 31 words))
word_lengths = np.array(raw_dataset["train"]["num_words"])
p99_threshold = int(np.percentile(word_lengths, 99))
print("Removing sentences longer than:", p99_threshold, "words")
raw_dataset["train"] = raw_dataset["train"].filter(
    lambda ex: ex["num_words"] <= p99_threshold
)


Removing sentences longer than: 31 words


Filter:   0%|          | 0/12000 [00:00<?, ? examples/s]

In [30]:
# Obtaining the desired Statistics
def summary_stats(arr):
    arr = np.array(arr)
    return {
        "count": int(arr.size),
        "min": int(arr.min()) if arr.size>0 else None,
        "p1": int(np.percentile(arr, 1)) if arr.size>0 else None,
        "p10": int(np.percentile(arr, 10)) if arr.size>0 else None,
        "median": float(np.median(arr)) if arr.size>0 else None,
        "mean": float(arr.mean()) if arr.size>0 else None,
        "std": float(arr.std(ddof=0)) if arr.size>0 else None,
        "p90": int(np.percentile(arr, 90)) if arr.size>0 else None,
        "p99": int(np.percentile(arr, 99)) if arr.size>0 else None,
        "max": int(arr.max()) if arr.size>0 else None,
    }

for split in raw_dataset:
    d = raw_dataset[split]
    print(f"\n=== {split.upper()} ===")
    print("Words:", summary_stats(d["num_words"]))
    print("Chars:", summary_stats(d["num_chars"]))
    print("Sentences:", summary_stats(d["num_sentences"]))



=== TRAIN ===
Words: {'count': 11912, 'min': 1, 'p1': 1, 'p10': 5, 'median': 11.0, 'mean': 12.170752182672935, 'std': 6.478840709255029, 'p90': 22, 'p99': 29, 'max': 31}
Chars: {'count': 11912, 'min': 1, 'p1': 1, 'p10': 30, 'median': 67.0, 'mean': 73.26544660846206, 'std': 37.62196715365125, 'p90': 129, 'p99': 161, 'max': 278}
Sentences: {'count': 11912, 'min': 1, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.542981867024849, 'std': 0.9439458074386832, 'p90': 3, 'p99': 5, 'max': 12}

=== VALIDATION ===
Words: {'count': 538, 'min': 1, 'p1': 1, 'p10': 5, 'median': 11.0, 'mean': 11.962825278810408, 'std': 6.470587802969751, 'p90': 21, 'p99': 28, 'max': 36}
Chars: {'count': 538, 'min': 1, 'p1': 1, 'p10': 28, 'median': 69.0, 'mean': 73.38289962825279, 'std': 39.0196100416231, 'p90': 125, 'p99': 169, 'max': 221}
Sentences: {'count': 538, 'min': 1, 'p1': 1, 'p10': 1, 'median': 1.0, 'mean': 1.3717472118959109, 'std': 0.694709906589641, 'p90': 2, 'p99': 4, 'max': 6}

=== TEST ===
Words: {'count'

In [31]:
# Sample Example
raw_dataset['train'][0]

{'Sentence': "@someUSER congratulations on you celebrating british kid singers sophia grace's and rosie's 1st anniversary of a visit of your show .  how",
 'English_Translation': "@some users congratulate you for celebrating British kid singers Sophia Grace's and Rosie's 1st anniversary visit of your show",
 'translation': {'en': "@some users congratulate you for celebrating British kid singers Sophia Grace's and Rosie's 1st anniversary visit of your show",
  'hing': "@someUSER congratulations on you celebrating british kid singers sophia grace's and rosie's 1st anniversary of a visit of your show .  how"},
 'num_words': 19,
 'num_chars': 126,
 'num_sentences': 1}

In [32]:
from datasets import DatasetDict

# New desired sizes
N_TRAIN = 2000
N_VAL   = 150
N_TEST  = 250

# Downsample using .select()
small_train = raw_dataset["train"].select(range(N_TRAIN))
small_val   = raw_dataset["validation"].select(range(N_VAL))
small_test  = raw_dataset["test"].select(range(N_TEST))

# Create a new DatasetDict
small_dataset = DatasetDict({
    "train": small_train,
    "validation": small_val,
    "test": small_test
})

small_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'English_Translation', 'translation', 'num_words', 'num_chars', 'num_sentences'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['Sentence', 'English_Translation', 'translation', 'num_words', 'num_chars', 'num_sentences'],
        num_rows: 150
    })
    test: Dataset({
        features: ['Sentence', 'English_Translation', 'translation', 'num_words', 'num_chars', 'num_sentences'],
        num_rows: 250
    })
})

In [33]:
model.eval() # Evaluation of model


IndicTransForConditionalGeneration(
  (model): IndicTransModel(
    (encoder): IndicTransEncoder(
      (embed_tokens): Embedding(122706, 512, padding_idx=1)
      (embed_positions): IndicTransSinusoidalPositionalEmbedding()
      (layers): ModuleList(
        (0-17): 18 x IndicTransEncoderLayer(
          (self_attn): IndicTransAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise

In [34]:
# Hinglish --> Hindi Translation

SRC_TAG = "hin_Deva"   # Hinglish / Roman Hindi
TGT_TAG = "eng_Latn"  # Hindi
def hinglish_to_hindi(sentences, batch_size=8, max_len=128):
    outputs = []

    device = model.device

    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]

        tagged = [
            f"{SRC_TAG} {TGT_TAG} {text}"
            for text in batch
        ]

        inputs = tokenizer(
            tagged,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=max_len
        ).to(device)

        with torch.no_grad():
            generated = model.generate(
                **inputs,
                max_length=max_len,
                num_beams=4,
                use_cache=False,       
                early_stopping=True
            )

        decoded = tokenizer.batch_decode(
            generated,
            skip_special_tokens=True
        )

        outputs.extend(decoded)

    return outputs


In [35]:
def convert_split(split):
    hinglish_sentences = [ex["hing"] for ex in split["translation"]]
    hindi_sentences = hinglish_to_hindi(hinglish_sentences)

    new_translation = []
    for i in range(len(split)):
        new_translation.append({
            "en": split["translation"][i]["en"],
            "hing": split["translation"][i]["hing"],
            "hi": hindi_sentences[i]
        })

    return split.remove_columns("translation").add_column(
        "translation", new_translation
    )


In [36]:
small_dataset["train"] = convert_split(small_dataset["train"])
small_dataset["validation"] = convert_split(small_dataset["validation"])
small_dataset["test"] = convert_split(small_dataset["test"])


Flattening the indices:   0%|          | 0/2000 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/150 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/250 [00:00<?, ? examples/s]

In [37]:
small_dataset["train"]["translation"][0]

{'en': "@some users congratulate you for celebrating British kid singers Sophia Grace's and Rosie's 1st anniversary visit of your show",
 'hi': "@ soomeUSER congratulations on you celebrating British kid singers Sophia Grace's and Rosie's 1st anniversary of a visit of your show.",
 'hing': "@someUSER congratulations on you celebrating british kid singers sophia grace's and rosie's 1st anniversary of a visit of your show .  how"}

In [38]:
small_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'English_Translation', 'num_words', 'num_chars', 'num_sentences', 'translation'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['Sentence', 'English_Translation', 'num_words', 'num_chars', 'num_sentences', 'translation'],
        num_rows: 150
    })
    test: Dataset({
        features: ['Sentence', 'English_Translation', 'num_words', 'num_chars', 'num_sentences', 'translation'],
        num_rows: 250
    })
})

In [39]:
raw_dataset = small_dataset  # As said in Task 2(for 2000 pair of sentence)

# Applying Tokenization:- How we obtain Embedding
* Attention Mask = Padding Mask x look Ahead Mask
* Input_ids = input_ids are tokenized text converted into numeric indices from tokenizer vocabulary.
* The model converts input_ids to embeddings internally through an embedding layer.

# Pipeline
* Text → Tokens → IDs → Embeddings → Transformer
* "I love India"
*      ↓              (tokenization)
* ["I","love","India"]
*      ↓              (vocab lookup)
* [34, 91, 2563]  ← input_ids
*      ↓
* [embedding vectors] ← actual embeddings used by model

In [40]:
# sample Example;
text = "Gud mrng sir aapko Mahashivratri ki hardik mangalkamnaye"
tokenizer("eng_Latn hin_Deva " + text)

{'input_ids': [4, 8, 2533, 36571, 5529, 7905, 50011, 79091, 4608, 25513, 54765, 79701, 61466, 8038, 51588, 14800, 60903, 59642, 32540, 85148, 6497, 81571, 7155, 15962, 5194, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [41]:
# Tokenize the Target language(English) - Example
with tokenizer.as_target_tokenizer():
    print(tokenizer("Hello Myself Virendra. A final year student at NIT Surat."))

{'input_ids': [7926, 23435, 8635, 11233, 5909, 71, 45, 893, 179, 1391, 38, 332, 6577, 8283, 71, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}




 # Preprocess Fxn for Tokenization:

In [42]:
# tags for IndicTrans2 Hindi->English
SRC_TAG = "hin_Deva"
TGT_TAG = "eng_Latn "

max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "hi"

def preprocess_function(examples):

    inputs = [f"{SRC_TAG} {TGT_TAG} {ex[source_lang].strip() if ex[source_lang] else ''}"
              for ex in examples["translation"]]
    targets = [ex[target_lang].strip() if ex[target_lang] else "" 
               for ex in examples["translation"]]

    # tokenize source (each string already prefixed with tags)
    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding=True,   
    )

    # Tokenizer for Target lang(Hinglish)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            [f"{TGT_TAG} {t}" if t else f"{TGT_TAG} " for t in targets],
            max_length=max_target_length,
            truncation=True,
            padding=True,
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [43]:
# Obtaining token of two rows of Train dataset
print(preprocess_function(raw_dataset["train"][:2])) 


{'input_ids': [[1, 1, 1, 8, 4, 14091, 1888, 48792, 66680, 99473, 66685, 18750, 13422, 90365, 37330, 69312, 85119, 65070, 9831, 2041, 12953, 73041, 70577, 29893, 66491, 1888, 6879, 60431, 23890, 66491, 1888, 203, 12505, 92843, 71624, 4123, 43705, 68463, 2], [8, 4, 14091, 3253, 17604, 29316, 6344, 59864, 87875, 44363, 67492, 4608, 73190, 21216, 13422, 41386, 94148, 8484, 45185, 65354, 6344, 60349, 78079, 46148, 91691, 4465, 3881, 7732, 2546, 19742, 2570, 53703, 4451, 4523, 5759, 12377, 9966, 19761, 2]], 'attention_mask': [[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[7580, 25232, 817, 516, 443, 5888, 84, 3446, 1304, 5195, 892, 6426, 6383, 25, 22, 6046, 1179, 4339, 14029, 21059, 18875, 4614, 17, 9, 10165, 1548, 4614, 17, 188, 794, 3074, 7, 13, 726, 7, 53, 637, 71, 2, 1, 1, 1, 1, 1, 1, 1,

In [44]:
# Setting  hyperparameter Values
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

In [45]:
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

Device: cuda


In [46]:
# 1) Tokenization of dataset and DataCollator
tokenized_datasets = raw_dataset.map(preprocess_function, batched=True,
                                    remove_columns=raw_dataset["train"].column_names)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="pt")

# 2)train_dataloader and test_dataloader

train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=batch_size,
                              shuffle=True, collate_fn=data_collator, num_workers=4)
validation_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=batch_size,
                                   shuffle=False, collate_fn=data_collator, num_workers=2)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [47]:
model.float()        # make sure parameters are float32
model.to(device)

# 3) Recreate optimizer (must be created after model param dtypes are finalized)
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

# 4) Mixed precision utilities
scaler = GradScaler()
grad_accum = 2
num_epochs = max(1, int(num_train_epochs))

allowed_keys = {"input_ids", "attention_mask", "labels", "decoder_input_ids", "decoder_attention_mask"}

In [48]:
# 5) Training loop
model.train()
for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")
    running_loss = 0.0
    steps = 0
    for batch in tqdm(train_dataloader, desc="Training"):
        # keep only model inputs and move to device
        model_batch = {k: v.to(device) for k, v in batch.items() if k in allowed_keys and isinstance(v, torch.Tensor)}

        with autocast("cuda"):                  # activations in fp16, params stay fp32
            outputs = model(**model_batch)
            loss = outputs.loss / grad_accum

        scaler.scale(loss).backward()

        if (steps + 1) % grad_accum == 0:
            # Unscale the Optimizer
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        running_loss += loss.item() * grad_accum
        steps += 1

    avg_train = running_loss / max(1, steps)
    print(f"Train loss: {avg_train:.4f}")

    # 6) Validation
    model.eval()
    vloss = 0.0
    vsteps = 0
    with torch.no_grad():
        for vb in tqdm(validation_dataloader, desc="Validation"):
            vb_batch = {k: v.to(device) for k, v in vb.items() if k in allowed_keys and isinstance(v, torch.Tensor)}
            with autocast("cuda"):
                out = model(**vb_batch)
            vloss += out.loss.item()
            vsteps += 1
    vloss = vloss / max(1, vsteps)
    print(f"Validation loss: {vloss:.4f}")
    model.train()

print("Training finished.")



Epoch 1/1


Training:   0%|          | 0/125 [00:00<?, ?it/s]

Train loss: 7.4972


Validation:   0%|          | 0/10 [00:00<?, ?it/s]

Validation loss: 6.2216
Training finished.


In [61]:
import re
URL_PATTERN = r"https?://\S+"
HANDLE_PATTERN = r"@\w+"
TECH_WORDS = [
    "ai/ml",
    "ai",
    "ml",
    "artificial intelligence",
    "machine learning",
    "data science",
    "deep learning"
]
SOCIAL_WORDS = [
    "really",
    "amazing",
    "awesome",
    "emotional",
    "touching",
    "bhai",
    "and",
    "sir",
    "madam",
    "fan",
    "fans",
    "love",
    "respect",
    "support"
]

WORD_PATTERN = r"\b(" + "|".join(map(re.escape, TECH_WORDS + SOCIAL_WORDS)) + r")\b"


In [62]:

import torch

# ----------------------------------
# Token protection (ROBUST)
# ----------------------------------
def protect_tokens(text):
    protected = {}
    idx = 0
    patterns = [
    URL_PATTERN,       # URLs
    HANDLE_PATTERN,    # @handles
    WORD_PATTERN       # single words (tech + social)
   ]
   
    for pattern in patterns:
        def repl(match):
            nonlocal idx
            key = f"XQZPLCH{idx}XQZ"
            protected[key] = match.group()
            idx += 1
            return key

        text = re.sub(pattern, repl, text, flags=re.IGNORECASE)

    return text, protected


def restore_tokens(text, protected):
    for k, v in protected.items():
        text = text.replace(k, v)
    return text


In [86]:
def hi_to_en(
    text,
    model,
    tokenizer,
    src_tag="eng_Latn",
    tgt_tag="hin_Deva",
    max_length=128
):
    # Protect non-linguistic / technical tokens
    safe_text, protected = protect_tokens(text)

    #  Add language tags
    tagged = f"{src_tag} {tgt_tag} {safe_text}"

    #  Tokenize
    inputs = tokenizer(tagged, return_tensors="pt").to(model.device)

    #  Generate (IndicTrans2-safe settings)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_length=max_length,
            num_beams=4,       # REQUIRED
            use_cache=False   # REQUIRED
        )

    #  Decode English
    eng_text = tokenizer.decode(out[0], skip_special_tokens=True)

    # Restore protected tokens
    eng_text = restore_tokens(eng_text, protected)

    return eng_text

In [66]:
text= "@someUSER congratulations on you celebrating british kid singers sophia grace's and rosie's 1st anniversary of a visit of your show .  how"
eng_text = hi_to_en(text, model, tokenizer)  
eng_text

"eng_Latn @someUSER congratulations to you British Kid Singers Sophia Grace's and Rosie's 1st anniversary of a visit of your show."

# Evaluation metrics

# Finding  BLEU AND CHRF:

1. BLEU: BLEU checks how many n-grams from the candidate sentence also appear in the reference sentence.
2. CHRF: Instead of words, CHRF compares character n-grams.

In [67]:
model.device

device(type='cuda', index=0)

In [68]:
len(raw_dataset["test"])

250

In [73]:
def clean_lang_tag(text):
    return text.replace("eng_Latn", "").strip()  # -- for Stripping eng_Latn

In [74]:
n_samples = 10 # no of samples of test used for Evaluation

# Checking for gpu Device
try:
    model_device = next(model.parameters()).device
except StopIteration:
    model_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Model device:", model_device)

# Finding Prediction and Reference List

preds = []
refs = []

for i in range(min(n_samples, len(raw_dataset["test"]))):
    row = raw_dataset["test"][i]

    # handle multiple possible row formats
    if isinstance(row, dict) and "translation" in row:
        trans = row["translation"]
        # trans might be a dict or a string; handle both
        if isinstance(trans, dict):
            eng = trans.get("en") or trans.get("eng") or ""
            hi_ref = trans.get("hi") or trans.get("hin") or ""
        else:
            # sometimes translation is a string (rare) — treat as source
            eng = str(trans)
            hi_ref = ""
    elif isinstance(row, dict):
        # maybe keys are directly 'en' and 'hi'
        eng = row.get("en") or row.get("eng") or row.get("source") or ""
        hi_ref = row.get("hi") or row.get("hin") or row.get("target") or ""
    else:
        # fallback: row itself might be the translation dict-like
        try:
            eng = row["en"]
            hi_ref = row["hi"]
        except Exception:
            # last resort: stringify
            eng = str(row)
            hi_ref = ""

    eng = (eng or "").strip()
    hi_ref = (hi_ref or "").strip()
    refs.append(eng if eng else "")  # keep alignment

    # add required tags
    tagged = f"{SRC_TAG} {TGT_TAG} {eng}"

    # tokenize -> torch tensors -> move to model device
    tokenized = tokenizer(tagged, return_tensors="pt", padding=True, truncation=True, max_length=128)
    tokenized = {k: v.to(model_device) for k, v in tokenized.items()}

    # generate
    with torch.no_grad():
        out_ids = model.generate(**tokenized, max_length=128, num_beams=4, early_stopping=True)

    eng_text = hi_to_en(eng, model, tokenizer)    
    eng_text = clean_lang_tag(eng_text)
    preds.append(eng_text)

Model device: cuda:0


In [75]:
preds[0:4]

['@BeingSalmanKhan Brother you are making us emotional.................................................................................................',
 'really touching and amazing.',
 'xQZPLCH0XQZ their hand xQZPLCH1XQZ their luggage and it is up to them whatever they do.',
 '@trisha_naik Elf is distracted the great sages.']

In [76]:
len(preds)


10

In [77]:
#BLEU:
bleu = sacrebleu.corpus_bleu(preds, [refs])
#CHRF:
chrf = sacrebleu.corpus_chrf(preds, [refs])

print(f"\nEvaluated {len(preds)} samples")
print("BLEU:", bleu.score)
print("CHRF:", chrf.score)

for i in range(min(5, len(preds))):
    print(f"\n=== SAMPLE {i+1} ===")
    print("SRC :", (raw_dataset["test"][i].get("translation", raw_dataset["test"][i]).get("hing")
                    if isinstance(raw_dataset["test"][i], dict) and "translation" in raw_dataset["test"][i]
                    else (raw_dataset["test"][i].get("hing") if isinstance(raw_dataset["test"][i], dict) else str(raw_dataset["test"][i]))))
    print("PRED:", preds[i])
    print("REF :", refs[i])



Evaluated 10 samples
BLEU: 10.277064174708585
CHRF: 46.24087795676007

=== SAMPLE 1 ===
SRC : @BeingSalmanKhan bhai ab aap emotional kar rahe ho. Kuch logo ki wajah se sabko chhod ke chale jaoge !
PRED: @BeingSalmanKhan Brother you are making us emotional.................................................................................................
REF : @BeingSalmanKhan brother you are making us emotional. because of  few people you are just leaving all of us.

=== SAMPLE 2 ===
SRC : really touching and amazing .  .  bachpan ki yaad aa gayi .  .  .
PRED: really touching and amazing.
REF : really touching and amazing. reminded me of my childhood.

=== SAMPLE 3 ===
SRC : @0__1 unka haath, aur unka samaan jo chahe kare
PRED: xQZPLCH0XQZ their hand xQZPLCH1XQZ their luggage and it is up to them whatever they do.
REF : @0__1 their hand and their luggage, it is upto them whatever they do.

=== SAMPLE 4 ===
SRC : @trisha_naik Apsaraye to bade bade rishi muniyo ka dhyaan bhang kar chuki ha

# Finding BERTScore:

* BERTScore: Uses BERT (or RoBERTa, or mBERT) embeddings to compare every token in candidate with every token in reference.

In [78]:
!pip install bert-score
from bert_score import score

P, R, F1 = score(preds, refs, lang="eng")
print("BERTScore F1:", F1.mean().item())

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.0.



tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

  _torch_pytree._register_pytree_node(


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

BERTScore F1: 0.7452279329299927


# Finding BLEURT:
* BLEURT computes a similarity score using a fine-tuned BERT model that predicts human judgment of translation quality.

In [79]:
# Required Packages for Bleurt
!pip install evaluate
!pip install git+https://github.com/google-research/bleurt.git

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-ogs5pdr1
  Running command git clone --filter=blob:none --quiet https://github.com/google-research/bleurt.git /tmp/pip-req-build-ogs5pdr1
  Resolved https://github.com/google-research/bleurt.git to commit cebe7e6f996b40910cfaa520a63db47807e3bf5c
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456766 sha256=063e0fa3722c1c83729fdae4893a8

In [80]:
# Calculating Bleurt
import evaluate
bleurt = evaluate.load("bleurt")
results = bleurt.compute(predictions=preds, references=refs)
print(results)

Downloading builder script: 0.00B [00:00, ?B/s]

Using default checkpoint 'bleurt-base-128' for sequence maximum length 128. You can use a bigger model for better results with e.g.: evaluate.load('bleurt', config_name='bleurt-large-512').


Downloading data:   0%|          | 0.00/405M [00:00<?, ?B/s]

INFO:tensorflow:Reading checkpoint /root/.cache/huggingface/metrics/bleurt/default/downloads/extracted/887f2dc36c17f53c287f696681b8f7c947278407c1cf9f226662e16c8c0dc417/bleurt-base-128.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.


I0000 00:00:1766078287.710023      47 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6212 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1766078287.710688      47 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13942 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


INFO:tensorflow:BLEURT initialized.


INFO:tensorflow:BLEURT initialized.


{'scores': [-1.5466673374176025, -0.15730614960193634, -0.6651703119277954, 0.5648964047431946, 0.6526470184326172, -0.48419198393821716, -0.5224010348320007, -1.4905223846435547, -1.2163634300231934, -1.3041622638702393]}


# Finding COMET
* COMET predicts a score that strongly correlates with human judgment.

In [87]:
# Required package for Comet
!pip install -q unbabel-comet
from comet import download_model, load_from_checkpoint


# choose model variable 
translation_model = globals().get("translation_model", None) or globals().get("model", None)
if translation_model is None:
    raise ValueError("No translation model found. Load your model into `model` or `translation_model` first.")

# device: try to get model device (handles DeviceMap too)
try:
    model_device = next(translation_model.parameters()).device
except StopIteration:
    model_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

SRC_TAG = "hin_Deva"
TGT_TAG = "eng_Latn"

srcs = []
preds = []
refs = []

n = 10
for i in range(min(n, len(raw_dataset["test"]))):
    row = raw_dataset["test"][i]

    # robust extraction of source & reference
    if isinstance(row, dict) and "translation" in row:
        trans = row["translation"]
        if isinstance(trans, dict):
            ref = (trans.get("en") or trans.get("eng") or "").strip()
            src= (trans.get("hing") or trans.get("hing") or "").strip()
        else:
            ref = str(trans).strip()
            src= ""
    elif isinstance(row, dict):
        ref = (row.get("en") or row.get("eng") or row.get("source") or "").strip()
        src = (row.get("hing") or row.get("hing") or row.get("target") or "").strip()
    else:
        # fallback
        ref= str(row).strip()
        src= ""

    srcs.append(src)
    refs.append(ref)

    # add language tags required by IndicTrans2
    tagged = f"{SRC_TAG} {TGT_TAG} {src}"

    # tokenize -> PyTorch tensors -> move to model device
    tokenized = tokenizer(tagged,
                          return_tensors="pt",
                          truncation=True,
                          padding=True,
                          max_length=128)
    tokenized = {k: v.to(model_device) for k, v in tokenized.items()}

    # generate
    with torch.no_grad():
        out_ids = translation_model.generate(**tokenized, max_length=128, num_beams=4, early_stopping=True)

    eng_text = hi_to_en(eng, model, tokenizer)    
    eng_text = clean_lang_tag(eng_text)
    preds.append(eng_text)

# ---- COMET evaluation ----
model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(model_path)

# prepare data list for COMET
data = [{"src": s, "mt": p, "ref": r} for s, p, r in zip(srcs, preds, refs)]
# note: comet_model.predict returns a dict; 'scores' contains numeric values
comet_out = comet_model.predict(data, batch_size=8)
comet_scores = comet_out["scores"] if isinstance(comet_out, dict) and "scores" in comet_out else comet_out

print("Samples evaluated:", len(preds))
print("Mean COMET score:", float(sum(comet_scores) / len(comet_scores)))

# preview
for i in range(5):
    print(f"\n--- SAMPLE {i+1} ---")
    print("SRC :", srcs[i])
    print("PRED:", preds[i])
    print("REF :", refs[i])


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
INFO:pytorch_lightning.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytor

Samples evaluated: 10
Mean COMET score: 0.310720382630825

--- SAMPLE 1 ---
SRC : @BeingSalmanKhan bhai ab aap emotional kar rahe ho. Kuch logo ki wajah se sabko chhod ke chale jaoge !
PRED: yaar bhi ko jaat hai xqzplch0xqz kya beer.............................................................................................
REF : @BeingSalmanKhan brother you are making us emotional. because of  few people you are just leaving all of us.

--- SAMPLE 2 ---
SRC : really touching and amazing .  .  bachpan ki yaad aa gayi .  .  .
PRED: :  bhai bhi bhi jiye toh bhi bhi bhi bhi bhi xqZPLCH0XQZ drink bhi bhi bhi bhi
REF : really touching and amazing. reminded me of my childhood.

--- SAMPLE 3 ---
SRC : @0__1 unka haath, aur unka samaan jo chahe kare
PRED: aap bhi jaat ke liye koi jaat ko xQZPLCH0XQZ drink beer.......................................................................................
REF : @0__1 their hand and their luggage, it is upto them whatever they do.

--- SAMPLE 4 ---
SRC :

In [91]:
# Saving the Model and tokenizer
model.save_pretrained("pt_model")
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/dict.SRC.json',
 'tokenizer/dict.TGT.json',
 'tokenizer/model.SRC',
 'tokenizer/model.TGT',
 'tokenizer/added_tokens.json')