# Language Translation Model 

This notebook attempts to perform an effective language translation from Hindi to English language<br/>
The dataset being used in this project is the IIT Bombay Hindi English Corpus which has been provided for free use on the internet 

## Step 1: Importing the Dataset

First we would start with the imports 

In [334]:
import pandas as pd 
import numpy as np 
import tensorflow as tf
import re 

We now go ahead and import the required datatset

In [335]:
from datasets import load_dataset
corpus_data = load_dataset('cfilt/iitb-english-hindi')
print(corpus_data)

Using custom data configuration cfilt--iitb-english-hindi-911387c6837f8b91
Reusing dataset parquet (C:\Users\parth\.cache\huggingface\datasets\parquet\cfilt--iitb-english-hindi-911387c6837f8b91\0.0.0\0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)
100%|██████████| 3/3 [00:00<00:00, 10.51it/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
})





We see that the corpus has the train, test and validation sets prepared by default<br/>
We can now go ahead and store them individually for further processing 

In [336]:
train_data = corpus_data["train"]["translation"]
test_data = corpus_data["test"]["translation"]
validation_data = corpus_data["validation"]["translation"]

In [337]:
train_data_en = [train_datum['en'] for train_datum in train_data]
train_data_hi = [train_datum['hi'] for train_datum in train_data]

test_data_en = [test_datum['en'] for test_datum in test_data]
test_data_hi = [test_datum['hi'] for test_datum in test_data]

validation_data_en = [validation_datum['en'] for validation_datum in validation_data]
validation_data_hi = [validation_datum['hi'] for validation_datum in validation_data]

## Step 2: Text preprocessing on the input data

We first try to perform preprocessing on both English and Hindi sentences.<br/>
This includes
<ul>
    <li>Removing sentences whose length is more than a defined value</li>
    <li>Removing unwanted characters from the remaining sentences</li>
    <li>Converting the sentences to a sequence of numbers (embeddings) so that they can be fed to the model</li>
    <li>Padding the sequences so that they are all of uniform length</li>
</ul>

In [338]:
maxlen = 30

en_sent_list = []
for en_sent in train_data_en:
    if len(en_sent) <= maxlen:
        en_sent_list.append(en_sent)
train_data_en = en_sent_list
en_sent_list = []
for en_sent in test_data_en:
    if len(en_sent) <= maxlen:
        en_sent_list.append(en_sent)
test_data_en = en_sent_list
en_sent_list = []
for en_sent in validation_data_en:
    if len(en_sent) <= maxlen:
        en_sent_list.append(en_sent)
validation_data_en = en_sent_list


hi_sent_list = []
for hi_sent in train_data_hi:
    if len(hi_sent) <= maxlen:
        hi_sent_list.append(hi_sent)
train_data_hi = hi_sent_list
hi_sent_list = []
for hi_sent in test_data_hi:
    if len(hi_sent) <= maxlen:
        hi_sent_list.append(hi_sent)
test_data_hi = hi_sent_list
hi_sent_list = []
for hi_sent in validation_data_hi:
    if len(hi_sent) <= maxlen:
        hi_sent_list.append(hi_sent)
validation_data_hi = hi_sent_list

In [339]:
def purge_unwanted_characters(data):
    """To remove the unwanted characters from the input data"""

    #Removing URLs with a regular expression
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    data = url_pattern.sub(r'', data)
    
    # Remove Emails
    data = re.sub(r'\S*@\S*\s?', '', data)
    
    # Remove new line characters
    data = re.sub(r'\s+', ' ', data)

    # Remove distracting single quotes
    data = re.sub(r"\'", "", data)

    # Remove numbers from text 
    data = re.sub(r'\d', '', data)

    # Remove underscores and other special characters from text 
    data = re.sub(r'[_#$%]', '', data)
        
    return data

string = "www.example.com 'is' the number 1_ website on this planet"
purge_unwanted_characters(string)

' is the number  website on this planet'

We will first proceed to perform preprocessing on English sentences

In [340]:
def preprocess_english(data):
    processed_data = []
    for sentence in data: 
        processed_data.append(purge_unwanted_characters(sentence).lower())
    return processed_data

train_data_en = preprocess_english(train_data_en)
test_data_en = preprocess_english(test_data_en)
validation_data_en = preprocess_english(validation_data_en)

Now, we will proceed onto preprocessing of Hindi sentences. Note that here we will be trying to not only remove the unwanted characters as mentioned earlier, we will also try to remove english characters as they won't help the model in training. 

In [341]:
def preprocess_hindi(data):
    processed_data = []
    for sentence in data:
        processed_sentence = purge_unwanted_characters(sentence)
        processed_sentence.replace("।", '') # remove the hindi full stop (purn viram)
        processed_data.append(re.sub(r'[a-zA-Z]', '', processed_sentence))
    return processed_data

train_data_hi = preprocess_hindi(train_data_hi)
test_data_hi = preprocess_hindi(test_data_hi)
validation_data_hi = preprocess_hindi(validation_data_hi)

Now we will move onto the word embedding part 

In [342]:
#Tokenize the texts and convert to sequences
en_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>', lower=False)
en_tokenizer.fit_on_texts(train_data_en)
en_train_sequences = en_tokenizer.texts_to_sequences(train_data_en)

hi_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>', lower=False)
hi_tokenizer.fit_on_texts(train_data_hi)
hi_train_sequences = hi_tokenizer.texts_to_sequences(train_data_hi)

We now do the same for test and validation datasets.<br/>
After creating the embeddings for all data, we are going to fetch the final size of the vocabulary which can be fetched from the tokenizer object. 

In [343]:
en_tokenizer.fit_on_texts(test_data_en)
en_test_sequences = en_tokenizer.texts_to_sequences(test_data_en)
en_tokenizer.fit_on_texts(validation_data_en)
en_validation_sequences = en_tokenizer.texts_to_sequences(validation_data_en)

hi_tokenizer.fit_on_texts(test_data_hi)
hi_test_sequences = hi_tokenizer.texts_to_sequences(test_data_hi)
hi_tokenizer.fit_on_texts(validation_data_hi)
hi_validation_sequences = hi_tokenizer.texts_to_sequences(validation_data_hi)


english_vocab_size = len(en_tokenizer.word_index) + 1
hindi_vocab_size = len(en_tokenizer.word_index) + 1

print("Vocab size of english data: ", english_vocab_size)
print("Vocab size of hindi data: ", hindi_vocab_size)

Vocab size of english data:  107958
Vocab size of hindi data:  107958


A simple look at one of the embeddings will help us understand how the embedding was done by keras 

In [344]:
print(en_train_sequences[0])
print(train_data_en[0])

[670, 1303]
highlight duration


Now we pad all the sequences so that they are all of uniform length<br/>
Note that we are using English sentences as inputs to the encoder part of our transformer model and the hindi ones as both the outputs and inputs for the decoder of the transformer<br/>
The latter is because we need the translated output from the decoder and we need to train it with the target language sequences 

In [345]:
#Pad english sentences for using them as encoder inputs  
en_train_sequences = tf.keras.preprocessing.sequence.pad_sequences(en_train_sequences, maxlen=maxlen, padding='post')
en_test_sequences = tf.keras.preprocessing.sequence.pad_sequences(en_test_sequences, maxlen=maxlen, padding='post')
en_validation_sequences = tf.keras.preprocessing.sequence.pad_sequences(en_validation_sequences, maxlen=maxlen, padding='post')

#Pad hindi sentences for using them as decoder outputs and inputs 
decoder_inputs_train = []
decoder_outputs_train = []

# remove the last token in input and first token in the output. This is because, decoder should be trained such that for every input token
# the next token in sequence should be presented as the output. The training will be done accordingly.
for hi in hi_train_sequences:
  decoder_inputs_train.append(hi[:-1]) 
  decoder_outputs_train.append(hi[1:])

decoder_inputs_train = tf.keras.preprocessing.sequence.pad_sequences(decoder_inputs_train, maxlen=maxlen, padding='post')
decoder_outputs_train = tf.keras.preprocessing.sequence.pad_sequences(decoder_outputs_train, maxlen=maxlen, padding='post')


decoder_inputs_test = []
decoder_outputs_test = []

for hi in hi_test_sequences:
  decoder_inputs_test.append(hi[:-1])
  decoder_outputs_test.append(hi[1:])

decoder_inputs_test = tf.keras.preprocessing.sequence.pad_sequences(decoder_inputs_test, maxlen=maxlen, padding='post')
decoder_outputs_test = tf.keras.preprocessing.sequence.pad_sequences(decoder_outputs_test, maxlen=maxlen, padding='post')

decoder_inputs_validation = []
decoder_outputs_validation = []

for hi in hi_validation_sequences:
  decoder_inputs_validation.append(hi[:-1])
  decoder_outputs_validation.append(hi[1:])

decoder_inputs_validation = tf.keras.preprocessing.sequence.pad_sequences(decoder_inputs_validation, maxlen=maxlen, padding='post')
decoder_outputs_validation = tf.keras.preprocessing.sequence.pad_sequences(decoder_outputs_validation, maxlen=maxlen, padding='post')

In [346]:
print("English sentence: ", train_data_en[0])
print("Encoding: ", en_train_sequences[0])

English sentence:  highlight duration
Encoding:  [ 670 1303    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


## 3. Model 

With the data preprocessing done, we should now proceed towards building the transformer which will perform the language translation task<br/>

First let us prepare the embedding layer 

In [347]:
from torch import nn

class Embedder(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
    def forward(self, x):
        embedding = self.embed(x)
        return embedding


Next we add the positional encoder 

In [348]:
import math
import torch 
from torch.autograd import Variable

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len = 80):
        super().__init__()
        self.d_model = d_model
        
        # create constant 'pe' matrix with values dependant on 
        # pos and i
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
                
        self.register_buffer('pe', pe)
 
    
    def forward(self, x):
        # make embeddings relatively larger
        x = x * math.sqrt(self.d_model)

        #add constant to embedding
        seq_len = x.size(0)
        
        x = x + Variable(self.pe[:seq_len], requires_grad=False)

        return x

Multi head attention 

In [349]:
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)
    
    def forward(self, q, k, v, mask=None):
        
        batch_size = q.size(0)
        
        # perform linear operation and split into h heads
        
        k = self.k_linear(k).view(batch_size, -1, self.h, self.d_k)
        q = self.q_linear(q).view(batch_size, -1, self.h, self.d_k)
        v = self.v_linear(v).view(batch_size, -1, self.h, self.d_k)
        
        # transpose to get dimensions bs * h * sl * d_model

        k = k.transpose(1,2)
        q = q.transpose(1,2)
        v = v.transpose(1,2)


        # calculate attention using function we will define next
        scores = self.attention(q, k, v, self.d_k, mask, self.dropout)
        print("scores shape: ", scores.shape)
        
        # concatenate heads and put through final linear layer
        concat = scores.transpose(1,2).contiguous().view(batch_size, -1, self.d_model)
        print("concat shape: ", concat.shape)
        
        output = self.out(concat)
        print("output shape: ", output.shape)
    
        return output
    
    def attention(self, q, k, v, d_k, mask=None, dropout=None):
        
        scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(d_k)

        # if mask is not None:
        #     mask = mask.unsqueeze(1)
        #     scores = scores.masked_fill(mask == 0, -1e9)
        scores = F.softmax(scores, dim=-1)

        if dropout is not None:
            scores = dropout(scores)

        output = torch.matmul(scores, v)

        return output

Feed forward neural network

In [350]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout = 0.1):
        super().__init__() 
        # We set d_ff as a default to 2048
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)
        return x

Various masks are created here 

<ul>
<li>Padding mask for English</li>
<li>Padding mask for Hindi</li>
<li>No peek mask for masking future tokens in order to avoid decoder from seeing the inputs early</li>
</ul>

In [361]:
size = maxlen

def no_peek_mask():
    nopeek_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
    nopeek_mask = torch.from_numpy(nopeek_mask) == 0

    return nopeek_mask.cuda()

def padding_mask(text):
    pad_mask = text != 0
    return pad_mask.cuda()

def create_masks(src_text, target_text):
    return padding_mask(src_text), no_peek_mask() & padding_mask(target_text)


no_peek_mask().unsqueeze(1).shape

torch.Size([1, 1, 30, 30])

Normalization layer 

In [352]:
class Norm(nn.Module):
    def __init__(self, d_model, eps = 1e-6):
        super().__init__()
    
        self.size = d_model
        # create two learnable parameters to calibrate normalisation
        self.alpha = nn.Parameter(torch.ones(self.size))
        self.bias = nn.Parameter(torch.zeros(self.size))
        self.eps = eps
    def forward(self, x):
        norm = self.alpha * (x - x.mean(dim=-1, keepdim=True))/(x.std(dim=-1, keepdim=True) + self.eps) + self.bias
        return norm

In [353]:
import copy

# build an encoder layer with one multi-head attention layer and one feed-forward layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout = 0.1):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.attn = MultiHeadAttention(heads, d_model)
        self.ff = FeedForward(d_model)
        self.dropout_1 = nn.Dropout(dropout, inplace=True)
        self.dropout_2 = nn.Dropout(dropout, inplace=True)
        
    def forward(self, x, mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn(x2,x2,x2,mask).squeeze(1))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.ff(x2))
        return x
    
# build a decoder layer with two multi-head attention layers and one feed-forward layer
class DecoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.norm_3 = Norm(d_model)
        
        self.dropout_1 = nn.Dropout(dropout, inplace=True)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)
        
        self.attn_1 = MultiHeadAttention(heads, d_model)
        self.attn_2 = MultiHeadAttention(heads, d_model)
        self.ff = FeedForward(d_model)

    def forward(self, x, e_outputs, src_mask, trg_mask):
            x2 = self.norm_1(x)
            x = x + self.dropout_1(self.attn_1(x2, x2, x2, trg_mask).squeeze(1))
            x2 = self.norm_2(x)
            x = x + self.dropout_2(self.attn_2(x2, e_outputs, e_outputs, src_mask).squeeze(1))
            x2 = self.norm_3(x)
            x = x + self.dropout_3(self.ff(x2))
            return x
# We can then build a convenient cloning function that can generate multiple layers:
def get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])

In [354]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads):
        super().__init__()
        self.N = N
        self.embed = Embedder(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model)
        self.layers = get_clones(EncoderLayer(d_model, heads), N)
        self.norm = Norm(d_model)
    def forward(self, src, mask):
        x = self.embed(src)
        x = self.pe(x)
        for i in range(self.N):
            x = self.layers[i](x, mask)
        return self.norm(x)
    
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads):
        super().__init__()
        self.N = N
        self.embed = Embedder(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model)
        self.layers = get_clones(DecoderLayer(d_model, heads), N)
        self.norm = Norm(d_model)
        self.softmax_layer = tf.keras.layers.Softmax()
    def forward(self, trg, e_outputs, src_mask, trg_mask):
        x = self.embed(trg)
        x = self.pe(x)
        for i in range(self.N):
            x = self.layers[i](x, e_outputs, src_mask, trg_mask)
        x =  self.norm(x)
        return x

In [355]:
class Transformer(nn.Module):
    def __init__(self, src_vocab, trg_vocab, d_model, N, heads):
        super().__init__()
        self.encoder = Encoder(src_vocab, d_model, N, heads)
        self.decoder = Decoder(trg_vocab, d_model, N, heads)
        self.out = nn.Linear(d_model, trg_vocab)
        self.softmax = tf.keras.layers.Softmax()
    def forward(self, src, trg, src_mask, trg_mask):
        e_outputs = self.encoder.forward(src, src_mask)
        d_output = self.decoder.forward(trg, e_outputs, src_mask, trg_mask)
        print(d_output.shape)
        output = self.out(d_output)
        return output

In [356]:
d_model = 512
heads = 8
N = 2

#Save model after each epoch
checkpoint = tf.keras.callbacks.ModelCheckpoint("best_model1.hdf5", monitor='val_accuracy', verbose=1, save_best_only=True, mode='auto', save_weights_only=False)

model = Transformer(english_vocab_size, hindi_vocab_size, d_model, N, heads).cuda()

for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)
    
# this code is very important! It initialises the parameters with a
# range of values that stops the signal fading or getting too big.
# See this blog for a mathematical explanation.
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

In [362]:
import time 

def train_model(epochs, print_every=100):
    
    model.train()
    
    start = time.time()
    temp = start
    
    total_loss = 0
    
    for epoch in range(epochs):
        for en_train_sequence, target_lang_input, target_lang_output in zip(en_train_sequences, decoder_inputs_train, decoder_outputs_train):
            source = torch.from_numpy(en_train_sequence).cuda()
            target_lang_input = torch.from_numpy(target_lang_input).cuda()
            target_lang_output = torch.from_numpy(target_lang_output).cuda()
            
            # create function to make masks using mask code above
            src_mask, trg_mask = create_masks(source, target_lang_input)
            
            preds = model(source, target_lang_input, src_mask, trg_mask)
            
            optimizer.zero_grad()

            target_lang_output = F.one_hot(target_lang_output.to(torch.int64), num_classes=hindi_vocab_size)
            print(target_lang_output.shape)

            loss = F.cross_entropy(preds.view(-1, preds.size(-1)).cuda(), target_lang_output.to(torch.float64).cuda())
            loss.backward()

            print(loss.data)
            
            total_loss += loss.data.item()
            if (epoch + 1) % print_every == 0:
                loss_avg = total_loss / print_every
                print("time = %dm, epoch %d, loss = %.3f, %ds per %d iters" % ((time.time() - start) // 60,
                epoch + 1, loss_avg, time.time() - temp, print_every))
                total_loss = 0
                temp = time.time()

train_model(1)