# Language Translation Model 

This notebook attempts to perform an effective language translation from Hindi to English language<br/>
The dataset being used in this project is the IIT Bombay Hindi English Corpus which has been provided for free use on the internet 

## Step 1: Importing the Dataset

First we would start with the imports 

In [9]:
import pandas as pd 
import numpy as np 
import gensim 
import tensorflow as tf
import re 
import spacy 

We now go ahead and import the required datatset

In [5]:
from datasets import load_dataset
corpus_data = load_dataset('cfilt/iitb-english-hindi')
print(corpus_data)

Using custom data configuration cfilt--iitb-english-hindi-930ee63dc3ad2bff
Reusing dataset parquet (C:\Users\parth\.cache\huggingface\datasets\parquet\cfilt--iitb-english-hindi-930ee63dc3ad2bff\0.0.0\0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)
100%|██████████| 3/3 [00:00<00:00, 37.58it/s]

DatasetDict({
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})





We see that the corpus has the train, test and validation sets prepared by default<br/>
We can now go ahead and store them individually for further processing 

In [6]:
train_data = corpus_data["train"]["translation"]
test_data = corpus_data["test"]["translation"]
validation_data = corpus_data["validation"]["translation"]

In [25]:
train_data_en = [train_datum['en'] for train_datum in train_data]
train_data_hi = [train_datum['hi'] for train_datum in train_data]

test_data_en = [test_datum['en'] for test_datum in test_data]
test_data_hi = [test_datum['hi'] for test_datum in test_data]

validation_data_en = [validation_datum['en'] for validation_datum in validation_data]
validation_data_hi = [validation_datum['hi'] for validation_datum in validation_data]

## Step 2: Text preprocessing on the input data

First we will define a few preprocessing methods meant for reuse<br/>
At a glance, we will be trying remove unwanted characters, unnecessary words, stemming the remaining words and then will move onto rebuilding the sentences with the processed tokens

In [7]:
from nltk.tokenize import TreebankWordDetokenizer
from nltk.stem import PorterStemmer 

def purge_unwanted_characters(data):
    """To remove the unwanted characters from the input data"""

    #Removing URLs with a regular expression
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    data = url_pattern.sub(r'', data)
    
    # Remove Emails
    data = re.sub('\S*@\S*\s?', '', data)
    
    # Remove new line characters
    data = re.sub('\s+', ' ', data)

    # Remove distracting single quotes
    data = re.sub("\'", "", data)
        
    return data

def remove_long_short_tokens(sentences):
    """To remove words longer than 15 letters and shorter than 2"""
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
    

def detokenize(text):
    """To merge the tokens into a sentence. To be done when context of sentence is needed"""
    return TreebankWordDetokenizer().detokenize(text)


stemmer = PorterStemmer()

def stem(words):
    """To perform stemming on given input of tokens"""
    stemmed_words = []
    for word in words: 
        stemmed_words.append(stemmer.stem(word))
    return stemmed_words

Now we will move onto the word embedding part for which we will be using GloVe embeddings<br/>
We will use the pretrained GloVe file with 200 dimensional embedding per word in order to find the embeddings of all the words in the input corpus 

In [28]:
from scipy import spatial

embeddings = {}
with open('glove.6B.200d.txt','r',encoding='utf8') as f:
  content_lines = f.readlines()  
  for line in content_lines:
    values = line.split()
    word = values[0]
    vector = np.asarray(values[1:],'float32')
    embeddings[word] = vector 

def find_similar_words(word_embedding):
  nearest = sorted(embeddings.keys(), key=lambda word: spatial.distance.euclidean(embeddings[word], word_embedding
))
  return nearest

With the GloVe embeddings dictionary defined, let us try to find the 5 most similar words for each of the token in an english sentence in the corpus 

In [33]:
print("Sentence is : ", train_data_en[0])
print("Words similar to these tokens are: ")
for word in train_data_en[0].split():
    print(find_similar_words(embeddings[word.lower()])[:5])

Sentence is :  Give your application an accessibility workout
Words similar to these tokens are: 
['give', 'take', 'giving', 'make', 'put']
['your', 'my', 'you', 'our', "'ll"]
['application', 'applications', 'allows', 'apply', 'applied']
['an', 'another', 'this', 'as', 'that']
['accessibility', 'responsiveness', 'availability', 'usability', 'importantly']
['workout', 'workouts', 'gym', 'regimen', 'offseason']
