<a href="https://colab.research.google.com/github/shashwatanand1801/Hindi_English-Translator/blob/main/Hindi_English.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Please ensure you have executed above cell at least once.**

In [1]:
%pip install torch transformers pandas sacremoses sentencepiece
%pip install datasets==1.18.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Importing the dataset**

In [2]:
from datasets import load_dataset

In [3]:
dataset = load_dataset("cfilt/iitb-english-hindi")



  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
dataset

DatasetDict({
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})

**The dataset dict has all 3 dataset for training, testing and validation.**
But we are only using a part of training dataset (ie, **100000 rows**) due to RAM constraint.

In [5]:
file = open("file.csv", "w+", encoding='utf8')

ds = dataset["train"]["translation"][:100000]

# en -> english
# hi -> hindi

for translation_pair in ds:
  source_sentence = translation_pair["hi"]
  target_sentence = translation_pair["en"]
  file.write(source_sentence.strip("\n") + "\t" + target_sentence.strip("\n") + "\n")

file.close()


In [6]:
import pandas as pd

df = pd.read_csv('./file.csv', 
                 sep='\t', names=['hindi', 'english'], on_bad_lines='skip')
df = df.astype(str)


**Preprocessing datasets**

In [7]:
import string

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

df['hindi'] = df['hindi'].apply(preprocess)
df['english'] = df['english'].apply(preprocess)


In [8]:
from sklearn.model_selection import train_test_split

train_texts, val_texts = train_test_split(df[['hindi', 'english']].values, test_size=0.2)


**Importing transformers from Hugging face**

In [9]:
from transformers import MarianTokenizer, MarianMTModel

tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-hi-en')
model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-hi-en')


**Tokenizing datasets**

In [10]:
train_encodings = tokenizer(text=list(train_texts[:, 0]), text_target=list(train_texts[:, 1]), 
                                                  truncation=True, padding=True)


In [11]:
val_encodings = tokenizer(text=list(val_texts[:, 0]), text_target=list(val_texts[:, 1]), 
                                                truncation=True, padding=True)


**Training models for translation from hindi to english**

In [12]:
import torch

class TranslationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

train_dataset = TranslationDataset(train_encodings)
val_dataset = TranslationDataset(val_encodings)

In [13]:
from torch.utils.data import DataLoader
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

optimizer = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    model.train()

    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)




**Testing**

In [15]:
inputs = [
    'नमस्ते, आप कैसे हैं?',
    'मैं एक विश्वविद्यालय में पढ़ता हूं',
    'तुमसे मिलकर अच्छा लगा!'
]

In [17]:
translated = model.generate(**tokenizer(inputs, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]



['Hello, how are you?', 'I read in a university', 'Nice to meet you!']