# Section 1: Problem Definition & Objective

## a. Selected Project Track
The selected project track for this work is Artificial Intelligence Applications, with a focus on Natural Language Processing (NLP). The project specifically addresses the task of Neural Machine Translation, where deep learning techniques are used to automatically translate text from one language to another.

## b. Clear Problem Statement
A large amount of digital content and documentation is available in the English language, which creates difficulties for users who primarily understand Hindi. Manual translation of English documents into Hindi is time-consuming and not always feasible.

The problem addressed in this project is to build an automated system that can translate English text into Hindi accurately, using a machine learning model trained on parallel language data.

## c. Real-World Relevance and Motivation
English-to-Hindi translation is highly relevant in a multilingual country like India. Such a system can help users understand documents, educational material, instructions, and online content written in English.

The motivation behind this project is to reduce the language barrier and demonstrate how deep learning can be used to solve real-world language problems. The project also provides practical experience in building and training a custom translation model using the PyTorch framework.

# Section 2: Data Understanding & Preparation

## a. Dataset Source
The dataset used in this project is a publicly available IIT Bombay English–Hindi parallel corpus, obtained from Kaggle. It consists of sentence pairs where each English sentence has a corresponding Hindi translation. This type of dataset is commonly used for training machine translation models and is suitable for supervised learning.

## b. Data Loading and Exploration
The dataset is loaded using the Pandas library. After loading, basic exploration is performed to understand the structure of the data, including the total number of sentence pairs and sample entries from the dataset.

Initial exploration helps verify that the dataset contains aligned English and Hindi sentences and allows inspection of sentence length and content before preprocessing.

In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv("data/hindi_english_parallel.csv")

# Display basic information
print("Total rows:", len(df))
df.head()

Total rows: 1561841


Unnamed: 0,hindi,english
0,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें,Give your application an accessibility workout
1,एक्सेर्साइसर पहुंचनीयता अन्वेषक,Accerciser Accessibility Explorer
2,निचले पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the bottom panel
3,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the top panel
4,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...,A list of plugins that are disabled by default


In [5]:
# Shuffle and sample the dataset
df = df.sample(n=10000, random_state=42)

print("Rows after sampling:", len(df))
df.head()

Rows after sampling: 10000


Unnamed: 0,hindi,english
957248,बडे पैमाने पर सुनामी से प्रभावीत जापान में 4 द...,"4 days after the massive tsunami struck Japan,..."
1072034,वर्ग का पूर्णा क्या था?,What was completing the square?
1195844,मैं अपना काम कर चुका हूँ।,I have already done my work.
1123517,राष्ट्रीय मनः स्वास्थ्य कार्यक्रम,National Mental Health Programme
933515,क्रियावली,menu


In [7]:
sentence_pairs = []

for _, row in df.iterrows():
    english = str(row['english']).strip()
    hindi = str(row['hindi']).strip()
    
    if english and hindi:
        sentence_pairs.append((english, hindi))

print("Total sentence pairs:", len(sentence_pairs))
print(sentence_pairs[0])

Total sentence pairs: 10000
('4 days after the massive tsunami struck Japan, hopes of finding anyone still alive were fading.', 'बडे पैमाने पर सुनामी से प्रभावीत जापान में 4 दिनो बाद कोई अभी तक जिंदा होने की आशाएँ लुप्त हो रही थी।')


## c. Cleaning, Preprocessing, and Feature Engineering
Several preprocessing steps are applied to prepare the data for model training:
- All English text is converted to lowercase.
- Unnecessary characters and punctuation are removed.
- Hindi text is normalized using Unicode normalization to ensure consistency.
- Extra spaces are removed from both English and Hindi sentences.
- Special tokens such as <sos> (start of sentence) and <eos> (end of sentence) are added.
- Vocabulary dictionaries are created for both English and Hindi languages by assigning unique indices to each word.
Vocabulary dictionaries are created for both English and Hindi languages by assigning unique indices to each word.

## d. Handling Missing Values or Noise
The dataset does not contain missing values in the sentence pairs. However, noisy data such as extremely short or invalid sentences is filtered out during preprocessing. Sentences with insufficient length are removed to improve the quality of training data.

In [10]:
import re
import unicodedata

def clean_english(text):
    text = text.lower()
    text = re.sub(r"[^a-z?.!,]+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def clean_hindi(text):
    text = unicodedata.normalize("NFKC", text)
    text = re.sub(r"[^\u0900-\u097F?.!,]+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

cleaned_pairs = []

for en, hi in sentence_pairs:
    en_clean = clean_english(en)
    hi_clean = clean_hindi(hi)
    
    if len(en_clean.split()) > 1 and len(hi_clean.split()) > 1:
        cleaned_pairs.append((en_clean, hi_clean))

print("After cleaning:", len(cleaned_pairs))
print(cleaned_pairs[0])

After cleaning: 7801
('days after the massive tsunami struck japan, hopes of finding anyone still alive were fading.', 'बडे पैमाने पर सुनामी से प्रभावीत जापान में दिनो बाद कोई अभी तक जिंदा होने की आशाएँ लुप्त हो रही थी।')


In [12]:
MAX_LEN = 30

filtered_pairs = []
for en, hi in cleaned_pairs:
    if len(en.split()) <= MAX_LEN and len(hi.split()) <= MAX_LEN:
        filtered_pairs.append((en, hi))

print("Before filtering:", len(cleaned_pairs))
print("After filtering:", len(filtered_pairs))


Before filtering: 7801
After filtering: 6557


In [14]:
# Reduce dataset size for training
filtered_pairs = filtered_pairs[:4500]

print("Pairs used for training:", len(filtered_pairs))


Pairs used for training: 4500


In [16]:
train_pairs = filtered_pairs[:3500]
test_pairs = filtered_pairs[3500:4500]

print("Train:", len(train_pairs))
print("Test:", len(test_pairs))

Train: 3500
Test: 1000


In [18]:
# Special tokens
PAD_TOKEN = "<pad>"
SOS_TOKEN = "<sos>"
EOS_TOKEN = "<eos>"
UNK_TOKEN = "<unk>"

In [20]:
class Vocabulary:
    def __init__(self):
        self.word2idx = {
            PAD_TOKEN: 0,
            SOS_TOKEN: 1,
            EOS_TOKEN: 2,
            UNK_TOKEN: 3
        }
        self.idx2word = {idx: word for word, idx in self.word2idx.items()}
        
        # FIX: initialize word_count with special tokens
        self.word_count = {
            PAD_TOKEN: 0,
            SOS_TOKEN: 0,
            EOS_TOKEN: 0,
            UNK_TOKEN: 0
        }

    def add_sentence(self, sentence):
        for word in sentence.split():
            if word not in self.word2idx:
                idx = len(self.word2idx)
                self.word2idx[word] = idx
                self.idx2word[idx] = word
                self.word_count[word] = 1
            else:
                self.word_count[word] += 1

In [22]:
tokenized_pairs = []

for en, hi in train_pairs:
    en_sentence = f"{SOS_TOKEN} {en} {EOS_TOKEN}"
    hi_sentence = f"{SOS_TOKEN} {hi} {EOS_TOKEN}"
    tokenized_pairs.append((en_sentence, hi_sentence))

print(tokenized_pairs[0])

('<sos> days after the massive tsunami struck japan, hopes of finding anyone still alive were fading. <eos>', '<sos> बडे पैमाने पर सुनामी से प्रभावीत जापान में दिनो बाद कोई अभी तक जिंदा होने की आशाएँ लुप्त हो रही थी। <eos>')


In [24]:
en_vocab = Vocabulary()
hi_vocab = Vocabulary()

for en_sentence, hi_sentence in tokenized_pairs:
    en_vocab.add_sentence(en_sentence)
    hi_vocab.add_sentence(hi_sentence)

print("English vocab size:", len(en_vocab.word2idx))
print("Hindi vocab size:", len(hi_vocab.word2idx))

English vocab size: 9718
Hindi vocab size: 9679


In [26]:
def sentence_to_indices(sentence, vocab):
    return [
        vocab.word2idx.get(word, vocab.word2idx[UNK_TOKEN])
        for word in sentence.split()
    ]

numeric_pairs = [
    (sentence_to_indices(en, en_vocab),
     sentence_to_indices(hi, hi_vocab))
    for en, hi in tokenized_pairs
]

print(numeric_pairs[0])

([1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 2], [1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2])


In [28]:
MAX_LEN = 30

def pad_sequence(seq, max_len, pad_idx=0):
    if len(seq) < max_len:
        return seq + [pad_idx] * (max_len - len(seq))
    else:
        return seq[:max_len]

padded_pairs = [
    (pad_sequence(en, MAX_LEN),
     pad_sequence(hi, MAX_LEN))
    for en, hi in numeric_pairs
]

print(padded_pairs[0])


([1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2, 0, 0, 0, 0, 0, 0, 0])


# Section 3: Model / System Design

## a. AI Technique Used
This project uses Deep Learning–based Natural Language Processing (NLP) techniques. Specifically, a Neural Machine Translation (NMT) approach is implemented using a Sequence-to-Sequence (Seq2Seq) model with an Attention mechanism.

The model is trained from scratch using the PyTorch deep learning framework. No pre-trained language models or large language models (LLMs) are used in this project.


## b. Model Architecture

### Encoder

- The encoder is implemented using an LSTM network.
- It takes tokenized English sentences as input.
- Each word is converted into an embedding vector.
- The encoder outputs hidden states that capture the meaning of the input sentence.

### Decoder

- The decoder is also implemented using an LSTM network.
- It generates the Hindi translation word by word.
- At each time step, the decoder uses attention to focus on relevant encoder states.
- The output is a probability distribution over the Hindi vocabulary.

### Attention Mechanism

- The attention mechanism allows the model to align input and output words.
- Instead of relying on a single fixed context vector, attention dynamically weighs encoder outputs.
- This improves translation quality, especially for longer sentences.


## c. Justification of Design Choices 

A Seq2Seq model with attention was chosen because it is well-suited for language translation tasks where input and output sequences have different lengths. The attention mechanism improves translation quality by helping the decoder focus on important words in the input sentence.

LSTM networks were selected due to their ability to handle sequential data and capture contextual dependencies in language. PyTorch was chosen as the implementation framework because of its flexibility, ease of debugging, and strong support for deep learning research.

To ensure feasible training on a CPU-based system, sentence length and dataset size were constrained. These design choices allowed successful model training while still demonstrating the complete working of a neural machine translation system.


# Section 4: Core Implementation

This section covers the implementation of the Neural Machine Translation system, including
the encoder, decoder, attention mechanism, and training pipeline.


In [34]:
import torch
import torch.nn as nn
import torch.optim as optim

In [35]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cpu


In [38]:
INPUT_DIM = len(en_vocab.word2idx)    # English vocabulary size
OUTPUT_DIM = len(hi_vocab.word2idx)   # Hindi vocabulary size

EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_LAYERS = 1
DROPOUT = 0.3

In [40]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, num_layers, dropout):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )

    def forward(self, src):
        # src shape: (batch_size, seq_len)
        embedded = self.embedding(src)
        # embedded shape: (batch_size, seq_len, embedding_dim)

        outputs, (hidden, cell) = self.lstm(embedded)

        return outputs, hidden, cell


In [42]:
class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
        self.v = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        # hidden shape: (num_layers, batch_size, hidden_dim)
        # encoder_outputs shape: (batch_size, seq_len, hidden_dim)

        batch_size = encoder_outputs.size(0)
        seq_len = encoder_outputs.size(1)

        # Use the last layer's hidden state
        hidden = hidden[-1].unsqueeze(1).repeat(1, seq_len, 1)
        # hidden shape: (batch_size, seq_len, hidden_dim)

        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        # energy shape: (batch_size, seq_len, hidden_dim)

        attention = self.v(energy).squeeze(2)
        # attention shape: (batch_size, seq_len)

        return torch.softmax(attention, dim=1)


In [44]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, num_layers, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention

        self.embedding = nn.Embedding(output_dim, embedding_dim)

        self.lstm = nn.LSTM(
            embedding_dim + hidden_dim,
            hidden_dim,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )

        self.fc_out = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell, encoder_outputs):
        # input shape: (batch_size)
        input = input.unsqueeze(1)
        # input shape: (batch_size, 1)

        embedded = self.dropout(self.embedding(input))
        # embedded shape: (batch_size, 1, embedding_dim)

        attention_weights = self.attention(hidden, encoder_outputs)
        # attention_weights shape: (batch_size, seq_len)

        attention_weights = attention_weights.unsqueeze(1)
        # shape: (batch_size, 1, seq_len)

        context = torch.bmm(attention_weights, encoder_outputs)
        # context shape: (batch_size, 1, hidden_dim)

        lstm_input = torch.cat((embedded, context), dim=2)
        # shape: (batch_size, 1, embedding_dim + hidden_dim)

        output, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))
        # output shape: (batch_size, 1, hidden_dim)

        output = output.squeeze(1)
        context = context.squeeze(1)

        prediction = self.fc_out(torch.cat((output, context), dim=1))
        # prediction shape: (batch_size, output_dim)

        return prediction, hidden, cell


## a. Model Training

The model is trained using a Seq2Seq architecture with an attention mechanism on English–Hindi sentence pairs. Training is performed using the Adam optimizer and Cross-Entropy loss. After training, the model is switched to evaluation mode to generate Hindi translations for unseen English sentences.

## b. Prompt Engineering

Prompt engineering is **not applicable** to this project, as the system does not use large language models (LLMs) or prompt-based inference. Instead, the model is trained from scratch using supervised learning on a parallel corpus.

In [47]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src shape: (batch_size, src_len)
        # trg shape: (batch_size, trg_len)

        batch_size = src.size(0)
        trg_len = trg.size(1)
        trg_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_outputs, hidden, cell = self.encoder(src)

        input = trg[:, 0]  # <sos>

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(
                input, hidden, cell, encoder_outputs
            )

            outputs[:, t, :] = output

            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(1)

            input = trg[:, t] if teacher_force else top1

        return outputs


In [49]:
# Initialize attention
attention = Attention(HIDDEN_DIM)

# Initialize encoder
encoder = Encoder(
    input_dim=INPUT_DIM,
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=HIDDEN_DIM,
    num_layers=NUM_LAYERS,
    dropout=DROPOUT
)

# Initialize decoder
decoder = Decoder(
    output_dim=OUTPUT_DIM,
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=HIDDEN_DIM,
    num_layers=NUM_LAYERS,
    dropout=DROPOUT,
    attention=attention
)

# Initialize Seq2Seq model
model = Seq2Seq(encoder, decoder, device).to(device)

# Define loss function (ignore padding)
criterion = nn.CrossEntropyLoss(ignore_index=0)

# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

print("Model, loss, and optimizer initialized successfully")


Model, loss, and optimizer initialized successfully


### Dataset and DataLoader Preparation


In [52]:
def sentence_to_indices(sentence, vocab):
    return [vocab.word2idx.get(word, vocab.word2idx["<unk>"]) for word in sentence.split()]


In [54]:
indexed_pairs = []

for en, hi in tokenized_pairs:
    en_ids = sentence_to_indices(en, en_vocab)
    hi_ids = sentence_to_indices(hi, hi_vocab)
    indexed_pairs.append((en_ids, hi_ids))


In [56]:
# Reduce dataset size for training due to computational constraints
indexed_pairs = indexed_pairs[:3000]


In [58]:
from torch.nn.utils.rnn import pad_sequence

def pad_pairs(pairs):
    src_seqs = [torch.tensor(pair[0]) for pair in pairs]
    trg_seqs = [torch.tensor(pair[1]) for pair in pairs]

    src_padded = pad_sequence(src_seqs, batch_first=True, padding_value=0)
    trg_padded = pad_sequence(trg_seqs, batch_first=True, padding_value=0)

    return src_padded, trg_padded


In [60]:
from torch.utils.data import TensorDataset, DataLoader

src_data, trg_data = pad_pairs(indexed_pairs)

dataset = TensorDataset(src_data, trg_data)

dataloader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True
)

print("DataLoader created successfully")


DataLoader created successfully


## c. Translation Pipeline

The translation pipeline follows a structured sequence of steps:
- Input English text is tokenized and converted into numerical indices.
- The encoder processes the input sequence to generate contextual representations.
- The attention mechanism computes alignment between input and output tokens.
- The decoder predicts the next Hindi word at each time step.
- Translation continues until an end-of-sentence token is generated.

In [63]:
def train_model(model, dataloader, optimizer, criterion, device):
    model.train()
    epoch_loss = 0

    for src, trg in dataloader:
        src = src.to(device)
        trg = trg.to(device)

        optimizer.zero_grad()

        output = model(src, trg)
        # output shape: (batch_size, trg_len, OUTPUT_DIM)

        output_dim = output.shape[-1]

        output = output[:, 1:].reshape(-1, output_dim)
        trg = trg[:, 1:].reshape(-1)

        loss = criterion(output, trg)
        loss.backward()

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(dataloader)


In [65]:
EPOCHS = 20

for epoch in range(EPOCHS):
    loss = train_model(model, dataloader, optimizer, criterion, device)
    print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {loss:.4f}")
    

Epoch 1/20, Loss: 7.4306
Epoch 2/20, Loss: 6.7865
Epoch 3/20, Loss: 6.4642
Epoch 4/20, Loss: 6.0303
Epoch 5/20, Loss: 5.3755
Epoch 6/20, Loss: 4.5387
Epoch 7/20, Loss: 3.7104
Epoch 8/20, Loss: 3.1050
Epoch 9/20, Loss: 2.7798
Epoch 10/20, Loss: 2.5133
Epoch 11/20, Loss: 2.2736
Epoch 12/20, Loss: 2.0455
Epoch 13/20, Loss: 1.8587
Epoch 14/20, Loss: 1.6609
Epoch 15/20, Loss: 1.4923
Epoch 16/20, Loss: 1.3426
Epoch 17/20, Loss: 1.2064
Epoch 18/20, Loss: 1.1034
Epoch 19/20, Loss: 0.9698
Epoch 20/20, Loss: 0.8828


# 5. Evaluation and Analysis

## a. Metrics Used
The model is evaluated using Cross-Entropy Loss during training, which measures how accurately the predicted Hindi words match the target Hindi words. A continuous decrease in loss across epochs indicates that the model is learning meaningful language patterns.

Along with loss, a qualitative evaluation is performed by testing the trained model on a few sample English sentences and manually observing the translated Hindi output. This helps understand how well the model performs in real translation scenarios, especially for sentence structure and word selection.

In [69]:
model.eval()
print("Model set to evaluation mode")


Model set to evaluation mode


In [71]:
def translate_sentence(sentence, model, en_vocab, hi_vocab, device, max_len=30):
    model.eval()

    # tokenize source sentence
    tokens = sentence.lower().split()
    tokens = ["<sos>"] + tokens + ["<eos>"]

    src_indices = [
        en_vocab.word2idx.get(token, en_vocab.word2idx["<unk>"])
        for token in tokens
    ]

    src_tensor = torch.tensor(src_indices).unsqueeze(0).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)

    # START target sequence
    trg_indices = [hi_vocab.word2idx["<sos>"]]

    for _ in range(max_len):
        trg_tensor = torch.tensor([trg_indices[-1]]).to(device)

        with torch.no_grad():
            output, hidden, cell = model.decoder(
                trg_tensor, hidden, cell, encoder_outputs
            )

        pred_token = output.argmax(1).item()
        trg_indices.append(pred_token)

        if pred_token == hi_vocab.word2idx["<eos>"]:
            break

    # convert indices to words (remove <sos>, stop at <eos>)
    trg_tokens = []
    for idx in trg_indices:
        token = hi_vocab.idx2word.get(idx, "<unk>")
        if token == "<eos>":
            break
        trg_tokens.append(token)

    return trg_tokens[1:]  # remove <sos>


## b. Sample Outputs
After training, the model is tested on unseen English sentences. The model generates Hindi translations word-by-word using the trained Seq2Seq + Attention architecture.

The observed results show that the model is able to produce Hindi tokens and short phrases, and in some cases it generates meaningful Hindi words. However, the translations are often not fully accurate, and the output may contain repeated words, incomplete sentence meaning, or incorrect grammar.

This demonstrates that the model has learned basic word associations and sentence patterns, but the translation quality is still limited compared to a real-world production translator.

In [78]:
test_sentences = [
    "i am a student",
    "he is a boy",
    "she is my friend",
    "she like studying maths",
    "we are learning"
]

for sentence in test_sentences:
    translation = translate_sentence(sentence, model, en_vocab, hi_vocab, device, max_len=30)
    translation = [w for w in translation if w not in ["<sos>", "<eos>", "<pad>"]]

    print(f"English: {sentence}")
    print(f"Hindi: {' '.join(translation)}")
    print("-" * 40)


English: i am a student
Hindi: मैं हूँ।
----------------------------------------
English: he is a boy
Hindi: है।
----------------------------------------
English: she is my friend
Hindi: है. है।
----------------------------------------
English: she like studying maths
Hindi: जैसे ली।
----------------------------------------
English: we are learning
Hindi: हम चाहते हैं
----------------------------------------


## Text File Translation (TXT Input)

This part of the project allows translating a plain text (.txt) file from English to Hindi.
The file is read, its content is extracted as text, and the trained model translates it line-by-line or sentence-by-sentence.
The translated output is then saved into a new text file for easy viewing.

In [81]:
import re

def translate_text_block(text, model, en_vocab, hi_vocab, device, max_len=30):
    """
    Translates a block of English text into Hindi using the trained model.
    Works line-by-line to keep it simple and stable.
    """
    lines = text.splitlines()
    translated_lines = []

    for line in lines:
        line = line.strip()
        if not line:
            translated_lines.append("")
            continue

        # basic cleanup to make model behave slightly better
        line_clean = re.sub(r"[^a-zA-Z0-9\s]", "", line).lower().strip()

        # very short/empty lines skip
        if len(line_clean.split()) == 0:
            translated_lines.append("")
            continue

        translated_tokens = translate_sentence(line_clean, model, en_vocab, hi_vocab, device, max_len=max_len)
        translated_lines.append(" ".join(translated_tokens))

    return "\n".join(translated_lines)

In [83]:
def translate_txt_file(input_path, output_path, model, en_vocab, hi_vocab, device, max_len=30):
    with open(input_path, "r", encoding="utf-8") as f:
        text = f.read()

    translated_text = translate_text_block(text, model, en_vocab, hi_vocab, device, max_len=max_len)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(translated_text)

    print("TXT Translation Completed ✅")
    print("Saved to:", output_path)


In [85]:
def translate_txt_file(input_path, output_path, model, en_vocab, hi_vocab, device, max_len=30):
    with open(input_path, "r", encoding="utf-8") as f:
        text = f.read()

    translated_text = translate_text_block(text, model, en_vocab, hi_vocab, device, max_len=max_len)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(translated_text)

    print("TXT Translation Completed ✅")
    print("Saved to:", output_path)


In [87]:
sample_txt_path = "sample_input.txt"
with open(sample_txt_path, "w", encoding="utf-8") as f:
    f.write("I am going to school.\nHe is my friend.\nThis is a good day.")
print("Created:", sample_txt_path)


Created: sample_input.txt


In [89]:
translate_txt_file(
    input_path="sample_input.txt",
    output_path="translated_output.txt",
    model=model,
    en_vocab=en_vocab,
    hi_vocab=hi_vocab,
    device=device,
    max_len=30
)


TXT Translation Completed ✅
Saved to: translated_output.txt


In [91]:
with open("translated_output.txt", "r", encoding="utf-8") as f:
    print(f.read())


मुझे संदेश रहा है।
है।
यह भी होता है।


## PDF Translation (PDF Input)

This section adds support for translating PDF documents.
Using the PyMuPDF (fitz) library, the text content is extracted from the PDF (limited pages can be chosen).
The extracted English text is then passed to the translation model, and the final Hindi translation is saved into an output .txt file.

In [94]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path, max_pages=5):
    doc = fitz.open(pdf_path)
    text_parts = []
    pages_to_read = min(len(doc), max_pages)

    for i in range(pages_to_read):
        page = doc[i]
        text_parts.append(page.get_text())

    doc.close()
    return "\n".join(text_parts)



In [96]:
def translate_pdf_file(pdf_path, output_txt_path, model, en_vocab, hi_vocab, device, max_len=30, max_pages=5):
    
    extracted_text = extract_text_from_pdf(pdf_path, max_pages=max_pages)

    translated_text = translate_text_block(extracted_text, model, en_vocab, hi_vocab, device, max_len=max_len)

    with open(output_txt_path, "w", encoding="utf-8") as f:
        f.write(translated_text)

    print("PDF Translation Completed ✅")
    print("Saved to:", output_txt_path)


In [98]:
translate_pdf_file(
    pdf_path="english.pdf",
    output_txt_path="translated_pdf_output.txt",
    model=model,
    en_vocab=en_vocab,
    hi_vocab=hi_vocab,
    device=device,
    max_len=30,
    max_pages=5
)


PDF Translation Completed ✅
Saved to: translated_pdf_output.txt


## c. Performance Analysis and Limitations

Training loss reduces significantly with increasing epochs, which confirms that the model learns from the dataset and improves internally. However, the final translation quality is affected by several limitations:
- The model is trained from scratch without using large pre-trained transformer models.
- The dataset size is reduced for faster training, which limits language exposure.
- The system is trained on CPU, so training complexity and epochs are restricted.
- The translations may fail on long or complex sentences due to limited context understanding.
- The output sometimes contains repetition or incorrect word ordering, showing that the model still struggles with fluency.
Overall, the project successfully demonstrates a working academic prototype of an English-to-Hindi translation system using deep learning, but further improvements are required for high-quality translation output.

# 6. Ethical Considerations & Responsible AI


## a. Bias and Fairness Considerations

- The translation model may reflect biases present in the training dataset, such as gender stereotypes or incorrect assumptions in certain sentences.
- Some words or phrases may be translated inaccurately depending on context, which can lead to unfair or misleading outputs.
- The model may perform better on common sentence patterns but worse on less frequent topics, which creates unequal performance across different user inputs.


## b. Dataset Limitations

- The dataset contains limited vocabulary coverage and does not represent all types of English and Hindi language usage.
- Some sentence pairs may contain noise, informal text, or inconsistent translations, which impacts model learning.
- The dataset may not include domain-specific language (medical, legal, technical), so translations in such contexts may be incorrect.
- Since a reduced subset of the dataset was used for training due to computational constraints, the model may not generalize well to unseen sentences.

## c. Responsible Use of AI Tools

- The translation system should be used as an academic prototype and not as a final trusted translator.
- The model output should be verified by humans before using it in important contexts such as healthcare, legal documents, or official communication.
- Users should be aware that the model can produce incorrect or misleading translations and should not blindly rely on it.
- Responsible development includes clearly mentioning limitations and avoiding false claims about accuracy or performance.

# 7. Conclusion and Future Scope


## a. Summary of Results

- An English-to-Hindi Neural Machine Translation system was successfully implemented using a Seq2Seq model with Attention in PyTorch.
- The complete workflow was achieved, including dataset loading, preprocessing, vocabulary creation, model training, inference, and result analysis.
- Training loss decreased consistently across epochs, indicating that the model learned useful translation patterns from the dataset.
- The system supports translation of single input sentences and also provides utilities for translating TXT files and PDF documents by extracting and processing the text.
- Although the translations are not always fully accurate or fluent, the project demonstrates a complete end-to-end translation pipeline and validates the practical implementation of a neural machine translation system.

## b. Possible Improvements and Extensions

- Train the model on a larger and more diverse English–Hindi dataset to improve translation accuracy and vocabulary coverage.
- Use a proper train/validation/test split and apply evaluation metrics more consistently for reliable performance measurement.
- Replace the Seq2Seq model with a Transformer-based architecture to achieve better long-range context understanding and more fluent translations.
- Improve decoding by using beam search instead of greedy decoding to reduce repetition and generate more meaningful output sentences.
- Optimize training using GPU acceleration, which will allow larger models, higher epochs, and faster experimentation.
- Enhance file translation by improving PDF text extraction for scanned or complex PDFs and adding formatting-aware translation for better readability.
- Develop a simple GUI or web application where users can upload TXT/PDF files and download translated output easily.
