# Language Translation Model 

This notebook attempts to perform an effective language translation from Hindi to English language<br/>
The dataset being used in this project is the IIT Bombay Hindi English Corpus which has been provided for free use on the internet 

## Step 1: Importing the Dataset

First we would start with the imports 

In [2]:
import pandas as pd 
import numpy as np 
import tensorflow as tf
import re 

We now go ahead and import the required datatset

In [3]:
from datasets import load_dataset
corpus_data = load_dataset('cfilt/iitb-english-hindi')
print(corpus_data)

Using custom data configuration cfilt--iitb-english-hindi-4e9610d2608dd062
Reusing dataset parquet (C:\Users\parth\.cache\huggingface\datasets\parquet\cfilt--iitb-english-hindi-4e9610d2608dd062\0.0.0\0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)
100%|██████████| 3/3 [00:00<00:00, 13.14it/s]

DatasetDict({
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
})





We see that the corpus has the train, test and validation sets prepared by default<br/>
We can now go ahead and store them individually for further processing 

In [4]:
train_data = corpus_data["train"]["translation"]
test_data = corpus_data["test"]["translation"]
validation_data = corpus_data["validation"]["translation"]

In [5]:
train_data_en = [train_datum['en'] for train_datum in train_data]
train_data_hi = [train_datum['hi'] for train_datum in train_data]

test_data_en = [test_datum['en'] for test_datum in test_data]
test_data_hi = [test_datum['hi'] for test_datum in test_data]

validation_data_en = [validation_datum['en'] for validation_datum in validation_data]
validation_data_hi = [validation_datum['hi'] for validation_datum in validation_data]

## Step 2: Text preprocessing on the input data

We first try to perform preprocessing on both English and Hindi sentences.<br/>
This includes removing unwanted characters, URLs and numbers from the sentence 

In [6]:
def purge_unwanted_characters(data):
    """To remove the unwanted characters from the input data"""

    #Removing URLs with a regular expression
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    data = url_pattern.sub(r'', data)
    
    # Remove Emails
    data = re.sub(r'\S*@\S*\s?', '', data)
    
    # Remove new line characters
    data = re.sub(r'\s+', ' ', data)

    # Remove distracting single quotes
    data = re.sub(r"\'", "", data)

    # Remove numbers from text 
    data = re.sub(r'\d', '', data)

    # Remove underscores and other special characters from text 
    data = re.sub(r'[_#$%]', '', data)
        
    return data

string = "www.example.com 'is' the number 1_ website on this planet"
purge_unwanted_characters(string)

' is the number  website on this planet'

We will first proceed to perform preprocessing on English sentences

In [7]:
def preprocess_english(data):
    processed_data = []
    for sentence in data: 
        processed_data.append(purge_unwanted_characters(sentence).lower())
    return processed_data

train_data_en = preprocess_english(train_data_en)
test_data_en = preprocess_english(test_data_en)
validation_data_en = preprocess_english(validation_data_en)

Now, we will proceed onto preprocessing of Hindi sentences. Note that here we will be trying to not only remove the unwanted characters as mentioned earlier, we will also try to remove english characters as they won't help the model in training. 

In [8]:
def preprocess_hindi(data):
    processed_data = []
    for sentence in data:
        processed_sentence = purge_unwanted_characters(sentence)
        processed_sentence.replace("।", '') # remove the hindi full stop (purn viram)
        processed_data.append(re.sub(r'[a-zA-Z]', '', processed_sentence))
    return processed_data

train_data_hi = preprocess_hindi(train_data_hi)
test_data_hi = preprocess_hindi(test_data_hi)
validation_data_hi = preprocess_hindi(validation_data_hi)

Now we will move onto the word embedding part for which we will be using FastText embeddings<br/>
This is because fasttext offers embeddings for both english and hindi languages.<br/>
Using this would be better as the vectors carry more meaning as compared to the default implementation 

In [9]:
from gensim.models import KeyedVectors 

english_embeddings = KeyedVectors.load_word2vec_format('cc.en.300.vec')
hindi_embeddings = KeyedVectors.load_word2vec_format('cc.hi.300.vec')

In [10]:
def embeddings_for_english(text):
    text_embedding = []
    oov_count = 0
    for sentence in text:
        sentence_embedding = []
        for word in sentence.split():
            try:
                sentence_embedding.append(english_embeddings.word_vec(word))
            except KeyError:
                sentence_embedding.append(np.zeros(300,)) # OOV words
                oov_count += 1
        text_embedding.append(sentence_embedding)
    return text_embedding, oov_count

def embeddings_for_hindi(text):
    text_embedding = []
    for sentence in text:
        sentence_embedding = []
        for word in sentence.split():
            try:
                sentence_embedding.append(hindi_embeddings.word_vec(word))
            except KeyError:
                sentence_embedding.append(np.zeros(300,)) # OOV words 
                oov_count += 1
        text_embedding.append(sentence_embedding)
    return text_embedding, oov_count

In [14]:
en_train_data_embeddings, en_train_oov_count = embeddings_for_english(train_data_en)
en_test_data_embeddings, en_test_oov_count = embeddings_for_english(test_data_en)
en_validation_data_embeddings, en_validation_oov_count = embeddings_for_english(validation_data_en)

hi_train_data_embeddings, hi_train_oov_count = embeddings_for_hindi(train_data_hi)
hi_test_data_embeddings, hi_test_oov_count = embeddings_for_hindi(test_data_hi)
hi_validation_data_embeddings, hi_validation_oov_count = embeddings_for_hindi(validation_data_hi)

  sentence_embedding.append(english_embeddings.word_vec(word))


In [13]:
count = 0
for sent_embed in en_train_data_embeddings:
    for word_embed in sent_embed: 
        count += 1

print("English train data OOV token proportion: ", en_train_oov_count/count)

count = 0
for sent_embed in en_test_data_embeddings:
    for word_embed in sent_embed: 
        count += 1

print("English test data OOV token proportion: ", en_test_oov_count/count)

count = 0
for sent_embed in en_validation_data_embeddings:
    for word_embed in sent_embed: 
        count += 1

print("English validation data OOV token proportion: ", en_validation_oov_count/count)




count = 0
for sent_embed in en_train_data_embeddings:
    for word_embed in sent_embed: 
        count += 1

print("Hindi train data OOV token proportion: ", hi_train_oov_count/count)

count = 0
for sent_embed in en_train_data_embeddings:
    for word_embed in sent_embed: 
        count += 1

print("Hindi test data OOV token proportion: ", hi_test_oov_count/count)

count = 0
for sent_embed in en_train_data_embeddings:
    for word_embed in sent_embed: 
        count += 1

print("Hindi validation data OOV token proportion: ", hi_validation_oov_count/count)

NameError: name 'en_train_data_embeddings' is not defined

## 3. Model 

With the data preprocessing done, we should now proceed towards building the transformer which will perform the language translation task 