# Data loading and cleaning

**Note:** To run the notebook upload given "train.csv" and "hindistatements_week2.csv" for phase 1 in the drive inside collaboratory folder.

In [None]:
import string
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import numpy as np
import pandas as pd

import random
import joblib
import pickle

In [None]:
device = torch.device("cpu")
device

device(type='cpu')

We will read the data from drive directly. For safety we will make a copy of the data and drop the first column from the data.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
data=pd.read_csv("/content/drive/MyDrive/Temp/Hindi3/train-set.csv")
data_copy=data.copy()
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,hindi,english
0,"एल सालवाडोर मे, जिन दोनो पक्षों ने सिविल-युद्ध...","In El Salvador, both sides that withdrew from ..."
1,मैं उनके साथ कोई लेना देना नहीं है.,I have nothing to do with them.
2,-हटाओ रिक.,"Fuck them, Rick."
3,क्योंकि यह एक खुशियों भरी फ़िल्म है.,Because it's a happy film.
4,The thought reaching the eyes...,The thought reaching the eyes...


In [None]:
data.shape

(102322, 2)

**Creating columns for sentence length for both Hindi and English**

Sentence length can play important role in cleaning the data, so we can create two new columns for lengths of hindi and english sentences.

In [None]:
data['len_hindi']=data['hindi'].apply(lambda x:len(x.split(' ')))
data['len_english']=data['english'].apply(lambda x: len(x.split(' ')))

**Checking for null values**

checking for null values is generally the first step in the data analaysis. Here also we can check if any null values are present in the given dataset or not.

In [None]:
data.isnull().sum()

hindi          0
english        0
len_hindi      0
len_english    0
dtype: int64

There is no null values present in the data

**Dropping duplicates from the data**

It may be the case that there are some duplicate entries present in the dataset. We can remove such entries.

In [None]:
print(f"Shape of the dataset before removing the duplicates: {data.shape}")
data.drop_duplicates(inplace=True)
print(f"Shape of the dataset after removing the duplicates: {data.shape}")

Shape of the dataset before removing the duplicates: (102322, 4)
Shape of the dataset after removing the duplicates: (102296, 4)


We can see there were some duplicates present in the dataset.

**Converting all chars in the lowercase**

We can perform lowercase normalization on the whole data.

In [None]:
data['hindi']=data['hindi'].apply(lambda x: x.lower())
data['english']=data['english'].apply(lambda x: x.lower())

**Removing punctuations**

There are different kinds of punctuations present in both hindi and english sentences. It's better to remove such punctuations in the data cleaning.



In [None]:
def remove_punctuations(sentence):
    punctuations=list(string.punctuation)
    cleaned=""
    for letter in sentence:
        if letter not in punctuations:
            cleaned+=letter
    return cleaned

In [None]:
data['hindi']=data['hindi'].apply(lambda x: remove_punctuations(x))
data['english']=data['english'].apply(lambda x: remove_punctuations(x))

**Removing mixed sentences (those samples which have english words in the hindi sentences)**

On observing the data, we find out that there are some samples in which english words are present between the hindi sentences. We treat these sentences as outliers and can remove them from the dataset.

In [None]:
def is_mixed(sentence):
    letters="abcdefghijklmnopqrstuvwxyz"
    for ch in letters:
        if ch in sentence:
            return True
    return False

In [None]:
data['is_mixed']=data['hindi'].apply(lambda x : is_mixed(x))
data.head()

Unnamed: 0,hindi,english,len_hindi,len_english,is_mixed
0,एल सालवाडोर मे जिन दोनो पक्षों ने सिविलयुद्ध स...,in el salvador both sides that withdrew from t...,22,23,False
1,मैं उनके साथ कोई लेना देना नहीं है,i have nothing to do with them,8,7,False
2,हटाओ रिक,fuck them rick,2,3,False
3,क्योंकि यह एक खुशियों भरी फ़िल्म है,because its a happy film,7,5,False
4,the thought reaching the eyes,the thought reaching the eyes,5,5,True


In [None]:
data['is_mixed'].value_counts(normalize=True)*100

False    94.430867
True      5.569133
Name: is_mixed, dtype: float64

In [None]:
data=data[data['is_mixed']==False]

**Changing the Encoding of the data**

The data may have some 'unicode' encoding. We need to change the encoding for processing of the data.

In [None]:
data['hindi']=data['hindi'].str.encode('utf-8',errors='ignore').str.decode('utf-8')
data['english']=data['english'].str.encode('ascii',errors='ignore').str.decode('utf-8')

**Dropping any row having NULL values**

In [None]:
null_indices=[]
for index,rows in data.iterrows():
    is_null=rows.isnull()
    if is_null.any():
        null_indices.append(index)

In [None]:
data.drop(null_indices,inplace=True)

**Saving the processed Dataframe**

We will save this processed dataset into drive, so we don't have to repeat these steps again and again.

In [None]:
data.to_csv("/content/drive/MyDrive/Temp/Hindi3/processed.csv")

# Loading processed Dataset and Vocabulary building

We will load the processed dataframe directly and build the vocabulary for source and target language.

### References:
1. https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
2. https://www.youtube.com/watch?v=B8g-PNT2W2Q
3. https://www.youtube.com/watch?v=EoGUlvhRYpk

In [None]:
data=pd.read_csv('/content/drive/MyDrive/Temp/Hindi3/processed.csv')

In [None]:
null_indices=[]
for index,rows in data.iterrows():
    is_null=rows.isnull()
    if is_null.any():
        null_indices.append(index)

data.drop(null_indices,inplace=True)

In [None]:
# We need to create dictionaries for both Hindi and English languages. We will also assign
# indexes to all the words in vocabulary.
SOS_token=0
EOS_token=1
PAD_token=2
MAX_LENGTH=10

class Vocab_class:
    def __init__(self):
        self.word_to_index={"<SOS>":0,"<EOS>":1,"<PAD>":2,"<UKN>":3}  #dict to map each token to an index
        self.word_counts={} # will keep track of each token in the vocabulary
        self.index_to_word={0:"<SOS>", 1:"<EOS>", 2:"<PAD>", 3:"<UKN>"} # will map each index to a token
        self.num_of_words=4

    def sentence_add(self, sentence): # function to add the words of a sentence into the vocabulary
        words=sentence.split(" ")
        for word in words:
            if word not in self.word_to_index:
                self.word_to_index[word]=self.num_of_words
                self.word_counts[word]=1

                self.index_to_word[self.num_of_words]=word
                self.num_of_words+=1
            else:
                self.word_counts[word]+=1

In [None]:
hindi_lang=Vocab_class()
eng_lang=Vocab_class()
pairs=[]  # this list will contain sentences after it has been added to vocabulary.
for index,row in data.iterrows():
    if row['len_hindi']<MAX_LENGTH and row['len_english']<MAX_LENGTH: # taking sentences only whose length is less than the MAX_LENGTH
        pair=[row['hindi'].strip(), row['english'].strip()]
        hin_extra=MAX_LENGTH-len(row['hindi'].strip().split(" "))
        eng_extra=MAX_LENGTH-len(row['english'].strip().split(" "))
        hindi_lang.sentence_add(pair[0])
        eng_lang.sentence_add(pair[1])
        pair[0]=pair[0].split(" ")
        pair[0].insert(0,"<SOS>")
        pair[0].append("<EOS>")
        pair[0]=pair[0]+["<PAD>"]*(hin_extra)

        pair[1]=pair[1].split(" ")
        pair[1].insert(0,"<SOS>")
        pair[1].append("<EOS>")
        pair[1]=pair[1]+["<PAD>"]*(eng_extra)

        # pair[1]=pair[1].split(" ")+["PAD"]*eng_extra
        pair[0]=" ".join(pair[0])
        pair[1]=" ".join(pair[1])
        pairs.append(pair)    #padded sentence pair is added

In [None]:
# checking how many words are present in both of the vocabularies
print(f"Hindi vocabulary size : {hindi_lang.num_of_words}")
print(f"English vocabulary size: {eng_lang.num_of_words}")

Hindi vocabulary size : 19757
English vocabulary size: 14747


In [None]:
joblib.dump(hindi_lang, '/content/drive/MyDrive/Temp/Hindi3/Model/hindi_lang.joblib')
joblib.dump(eng_lang, '/content/drive/MyDrive/Temp/Hindi3/Model/eng_lang.joblib')

pickle.dump(hindi_lang, open('/content/drive/MyDrive/Temp/Hindi3/Model/hindi_lang.pkl', 'wb'))
pickle.dump(eng_lang, open('/content/drive/MyDrive/Temp/Hindi3/Model/eng_lang.pkl', 'wb'))

In [None]:
def pair_to_tensor(pair):
    '''
    A function to convert a given pair to tensors corresponding to index in vocabulary
    '''
    hindi_sentence=pair[0]
    eng_sentence=pair[1]
    indexes_hindi=[hindi_lang.word_to_index[word] for word in hindi_sentence.split(' ')]
    indexes_eng=[eng_lang.word_to_index[word] for word in eng_sentence.split(' ')]
    hindi_tensor=torch.tensor(indexes_hindi, dtype=torch.long, device=device).view(-1,1)
    eng_tensor=torch.tensor(indexes_eng, dtype=torch.long, device=device).view(-1,1)
    return (hindi_tensor, eng_tensor)

In [None]:
hin_tensors=[]
eng_tensors=[]
for pair in pairs:  #will convert each pair into tensors for further processing
    hin,eng=pair_to_tensor(pair)
    hin_tensors.append(hin)
    eng_tensors.append(eng)

# Seq2Seq Model Implementation

#### References:
1. https://machinelearningmastery.com/encoder-decoder-recurrent-neural-network-models-neural-machine-translation/
2. https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346


### Encoder RNN Implementation

In [None]:
### Encoder RNN Implementation
class EncoderLSTM(nn.Module):
    def __init__(self,size_input,size_embbeding,size_hidden,layers,p):
        super(EncoderLSTM,self).__init__()
        self.size_input=size_input
        self.size_embbeding=size_embbeding
        self.size_hidden=size_hidden
        self.layers=layers
        self.dropout=nn.Dropout(p)
        self.tag=True

        self.embbed_layer=nn.Embedding(self.size_input,self.size_embbeding)
        self.lstm=nn.LSTM(self.size_embbeding,self.size_hidden,self.layers,dropout=p)

    def forward(self, x):
        # print(x.shape)
        embbeding=self.dropout(self.embbed_layer(x))
        # print(embbeding.shape)
        output, (hidden_st,cell_st) = self.lstm(embbeding)
        return hidden_st, cell_st # will return only hidden and cell state from the Encoder

### Decoder RNN Implementation


In [None]:
### Decoder RNN Implementation
class DecoderLSTM(nn.Module):
    def __init__(self,size_input,size_embbeding,size_hidden,layers,p,size_output):
        super(DecoderLSTM,self).__init__()
        self.size_input=size_input
        self.size_embbeding=size_embbeding
        self.size_hidden=size_hidden
        self.layers=layers
        self.size_output=size_output
        self.dropout=nn.Dropout(p)
        # self.tag=True

        self.embbed_layer=nn.Embedding(self.size_input,self.size_embbeding) # input_size X embedding_size
        self.lstm=nn.LSTM(self.size_embbeding,self.size_hidden,self.layers,dropout=p) #embedding_size * hidden_size
        self.fc=nn.Linear(self.size_hidden,self.size_output) # hidden_size*output_size

    def forward(self,x,hidden_st,cell_st):
        x=x.unsqueeze(0)
        embbeding=self.dropout(self.embbed_layer(x))
        outputs, (hidden_st, cell_st) = self.lstm(embbeding, (hidden_st,cell_st))
        preds=self.fc(outputs)
        preds=preds.squeeze(0)
        return preds,hidden_st,cell_st

### Encoder-Decoder Interface Implementation

In [None]:
class Seq2seq_model(nn.Module):
    def __init__(self,encoder_net,decoder_net):
        super(Seq2seq_model,self).__init__()
        self.encoder_net=encoder_net
        self.decoder_net=decoder_net

    def forward(self,src,target,teacher_forcing=0.5):
        batch_length=src.shape[1]
        target_len=target.shape[0]
        target_vocab_len=eng_lang.num_of_words

        output_tensor=torch.zeros(target_len,batch_length,target_vocab_len).to(device)
        hidden_st_enc, cell_st_enc=self.encoder_net(src)
        x=target[0]

        for i in range(1,target_len):
            output,hidden_st_dec,cell_st_dec=self.decoder_net(x,hidden_st_enc,cell_st_enc)
            output_tensor[i]=output
            pred=output.argmax(1)
            x=target[i] if random.random()<teacher_forcing else pred #teacher forcing is used with probability 0.5

        return output_tensor

**Creating objects of Encoder, Decoder and Seq2Seq model**

In [None]:
encoder_ip_size=hindi_lang.num_of_words #equal to hindi vocab size
encoder_embbeding_size=400  #encoder embedding size, tried various values to finalize this value
encoder_hidden_size=512 # previously tried with 1024 but this value gives better score
encoder_layers=1  # LSTM layers=1
encoder_dropout=float(0.5)  # dropout if applied, for layers=1 no need of dropout

encoder_obj=EncoderLSTM(encoder_ip_size, encoder_embbeding_size, encoder_hidden_size, encoder_layers,
                        encoder_dropout).to(device) # creating object of EncoderLSTM class

print(encoder_obj)

EncoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embbed_layer): Embedding(19757, 400)
  (lstm): LSTM(400, 512, dropout=0.5)
)




In [None]:
# embedding size, hidden_size and number of layers will be same as of Encoders
decoder_ip_size=eng_lang.num_of_words
decoder_embbed_size=400
decoder_hidden_size=512
decoder_layers=1
decoder_dropout=float(0.5)
decoder_op_size=eng_lang.num_of_words

decoder_obj=DecoderLSTM(decoder_ip_size,decoder_embbed_size,decoder_hidden_size,
                        decoder_layers, decoder_dropout, decoder_op_size).to(device)  # creating object of DecoderLSTM

print(decoder_obj)

DecoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embbed_layer): Embedding(14747, 400)
  (lstm): LSTM(400, 512, dropout=0.5)
  (fc): Linear(in_features=512, out_features=14747, bias=True)
)


In [None]:
model=Seq2seq_model(encoder_obj, decoder_obj)
print(model)

Seq2seq_model(
  (encoder_net): EncoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embbed_layer): Embedding(19757, 400)
    (lstm): LSTM(400, 512, dropout=0.5)
  )
  (decoder_net): DecoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embbed_layer): Embedding(14747, 400)
    (lstm): LSTM(400, 512, dropout=0.5)
    (fc): Linear(in_features=512, out_features=14747, bias=True)
  )
)


In [None]:
joblib.dump(model, '/content/drive/MyDrive/Temp/Hindi3/Model/model.joblib')
pickle.dump(model, open('/content/drive/MyDrive/Temp/Hindi3/Model/model.pkl', 'wb'))

## Setting the training of model

**Note:** To train the model again, please set train_model=True

In [None]:
batch_size=64
optimizer=optim.Adagrad(model.parameters(),lr=0.005)  #slowed down the learning rate to better convergence
PATH="/content/drive/MyDrive/Temp/Hindi3/phase2_v2.pth"

#model was trained for 90 epochs but due to session length limit it was trained in 3 steps of 30 epochs each
epochs=30
epoch_loss=0.0
padding_idx=eng_lang.word_to_index["<PAD>"]
criterion=nn.CrossEntropyLoss(ignore_index=padding_idx) #ignore padding index while calculating loss

train_model=True #if need to train the model again, set it to True
model_available=False

if train_model==False:
    model=torch.load(PATH)
else:
    if model_available:
        model=torch.load(PATH)
    batches=len(pairs)//batch_size
    for epoch in range(epochs):
        print(f"epoch {epoch+1}/{epochs}")
        model.eval()
        model.train(True)
        cur_batch=0
        for idx in range(0,len(pairs),batch_size):
            cur_batch+=1
            if(cur_batch%100==0):
                print(f"    running batch {cur_batch} of {batches}")
            if idx+batch_size < len(pairs):
                src_batch=hin_tensors[idx:idx+batch_size]
                target_batch=eng_tensors[idx:idx+batch_size]
            else:
                src_batch=hin_tensors[idx:]
                target_batch=eng_tensors[idx:]

            src_batch=torch.cat(src_batch,dim=1)     #max_len*batch_size
            target_batch=torch.cat(target_batch,dim=1)

            output=model(src_batch,target_batch)
            output=output[1:].reshape(-1,output.shape[2])
            target=target_batch[1:].reshape(-1)

            optimizer.zero_grad()
            loss=criterion(output,target)

            loss.backward()
            # restrict gradients from exploding
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

            optimizer.step()
            epoch_loss += loss.item()

        print(f"Epoch loss : {loss.item()}")

        torch.save(model,PATH)
        model_available=True

epoch 1/30
    running batch 100 of 854
    running batch 200 of 854
    running batch 300 of 854
    running batch 400 of 854
    running batch 500 of 854
    running batch 600 of 854
    running batch 700 of 854
    running batch 800 of 854
Epoch loss : 5.173855304718018
epoch 2/30
    running batch 100 of 854
    running batch 200 of 854
    running batch 300 of 854
    running batch 400 of 854
    running batch 500 of 854
    running batch 600 of 854
    running batch 700 of 854
    running batch 800 of 854
Epoch loss : 4.801079273223877
epoch 3/30
    running batch 100 of 854
    running batch 200 of 854
    running batch 300 of 854
    running batch 400 of 854
    running batch 500 of 854
    running batch 600 of 854
    running batch 700 of 854
    running batch 800 of 854
Epoch loss : 4.593860149383545
epoch 4/30
    running batch 100 of 854
    running batch 200 of 854
    running batch 300 of 854
    running batch 400 of 854
    running batch 500 of 854
    running batch 600 

Checking translation of 50 sentences from the training set.

In [None]:
test_sentences=[pair[0] for pair in pairs[50:100]]
actual_sentences=[pair[1] for pair in pairs[50:100]]
pred_sentences=[]

for idx,i in enumerate(test_sentences):
    # print(i)
    translated=predict_translation(model,i,device)
    print("*"*20)
    print(f"Hindi: {i}")
    print(f"Actual: {actual_sentences[idx]}")
    print(f"Predicted: {translated}")
    print("*"*20)

# Generating Validation Set results

In [None]:
import string
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import numpy as np
import pandas as pd

import random
import joblib
import pickle
device = torch.device("cpu")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
### Encoder RNN Implementation
class EncoderLSTM(nn.Module):
    def __init__(self,size_input,size_embbeding,size_hidden,layers,p):
        super(EncoderLSTM,self).__init__()
        self.size_input=size_input
        self.size_embbeding=size_embbeding
        self.size_hidden=size_hidden
        self.layers=layers
        self.dropout=nn.Dropout(p)
        self.tag=True

        self.embbed_layer=nn.Embedding(self.size_input,self.size_embbeding)
        self.lstm=nn.LSTM(self.size_embbeding,self.size_hidden,self.layers,dropout=p)

    def forward(self, x):
        # print(x.shape)
        embbeding=self.dropout(self.embbed_layer(x))
        # print(embbeding.shape)
        output, (hidden_st,cell_st) = self.lstm(embbeding)
        return hidden_st, cell_st # will return only hidden and cell state from the Encoder

In [None]:
### Decoder RNN Implementation
class DecoderLSTM(nn.Module):
    def __init__(self,size_input,size_embbeding,size_hidden,layers,p,size_output):
        super(DecoderLSTM,self).__init__()
        self.size_input=size_input
        self.size_embbeding=size_embbeding
        self.size_hidden=size_hidden
        self.layers=layers
        self.size_output=size_output
        self.dropout=nn.Dropout(p)
        # self.tag=True

        self.embbed_layer=nn.Embedding(self.size_input,self.size_embbeding) # input_size X embedding_size
        self.lstm=nn.LSTM(self.size_embbeding,self.size_hidden,self.layers,dropout=p) #embedding_size * hidden_size
        self.fc=nn.Linear(self.size_hidden,self.size_output) # hidden_size*output_size

    def forward(self,x,hidden_st,cell_st):
        x=x.unsqueeze(0)
        embbeding=self.dropout(self.embbed_layer(x))
        outputs, (hidden_st, cell_st) = self.lstm(embbeding, (hidden_st,cell_st))
        preds=self.fc(outputs)
        preds=preds.squeeze(0)
        return preds,hidden_st,cell_st

In [None]:
class Seq2seq_model(nn.Module):
    def __init__(self,encoder_net,decoder_net):
        super(Seq2seq_model,self).__init__()
        self.encoder_net=encoder_net
        self.decoder_net=decoder_net

    def forward(self,src,target,teacher_forcing=0.5):
        batch_length=src.shape[1]
        target_len=target.shape[0]
        target_vocab_len=eng_lang.num_of_words

        output_tensor=torch.zeros(target_len,batch_length,target_vocab_len).to(device)
        hidden_st_enc, cell_st_enc=self.encoder_net(src)
        x=target[0]

        for i in range(1,target_len):
            output,hidden_st_dec,cell_st_dec=self.decoder_net(x,hidden_st_enc,cell_st_enc)
            output_tensor[i]=output
            pred=output.argmax(1)
            x=target[i] if random.random()<teacher_forcing else pred #teacher forcing is used with probability 0.5

        return output_tensor

In [None]:
SOS_token=0
EOS_token=1
PAD_token=2
MAX_LENGTH=10
def clean_sentence(sentence):
    punctuations=list(string.punctuation)
    cleaned=""
    for letter in sentence:
        if letter=='<' or letter=='>' or letter not in punctuations:
            cleaned+=letter
    return cleaned

def predict_translation(model,sentence,device,max_length=MAX_LENGTH):
    sentence=clean_sentence(sentence)
    tokens=sentence.split(" ")
    indexes=[]
    for token in tokens:
        if token in hindi_lang.word_to_index:
            indexes.append(hindi_lang.word_to_index[token])
        else:
            indexes.append(hindi_lang.word_to_index["<UKN>"])
    tensor_of_sentence=torch.LongTensor(indexes).unsqueeze(1).to(device)
    with torch.no_grad():
        hidden,cell=model.encoder_net(tensor_of_sentence)
    outputs=[SOS_token]
    for _ in range(max_length):
        prev_word=torch.LongTensor([outputs[-1]]).to(device)
        with torch.no_grad():
            output,hidden,cell=model.decoder_net(prev_word, hidden,cell)
            pred=output.argmax(1).item()

        outputs.append(pred)

        if eng_lang.index_to_word[pred] =="<EOS>":
            break

    final=[]

    for i in outputs:
        if i == "<PAD>":
            break
        final.append(i)

    final = [eng_lang.index_to_word[idx] for idx in final]
    translated=" ".join(final)
    return translated

In [None]:
model=joblib.load('/content/drive/MyDrive/Temp/Hindi3/Model/model.joblib')

In [None]:
model=torch.load("/content/drive/MyDrive/Temp/Hindi3/phase2_v2.pth")

In [None]:
#model = pickle.load(open('/content/drive/MyDrive/Temp/Hindi3/Model/model.pkl', 'rb'))
#eng_lang = pickle.load(open('/content/drive/MyDrive/Temp/Hindi3/Model/eng_lang.pkl', 'rb'))
#hindi_lang = pickle.load(open('/content/drive/MyDrive/Temp/Hindi3/Model/hindi_lang.pkl', 'rb'))

In [None]:
# We need to create dictionaries for both Hindi and English languages. We will also assign
# indexes to all the words in vocabulary.
SOS_token=0
EOS_token=1
PAD_token=2
MAX_LENGTH=10

class Vocab_class:
    def __init__(self):
        self.word_to_index={"<SOS>":0,"<EOS>":1,"<PAD>":2,"<UKN>":3}  #dict to map each token to an index
        self.word_counts={} # will keep track of each token in the vocabulary
        self.index_to_word={0:"<SOS>", 1:"<EOS>", 2:"<PAD>", 3:"<UKN>"} # will map each index to a token
        self.num_of_words=4

    def sentence_add(self, sentence): # function to add the words of a sentence into the vocabulary
        words=sentence.split(" ")
        for word in words:
            if word not in self.word_to_index:
                self.word_to_index[word]=self.num_of_words
                self.word_counts[word]=1

                self.index_to_word[self.num_of_words]=word
                self.num_of_words+=1
            else:
                self.word_counts[word]+=1

In [None]:
hindi_lang=joblib.load('/content/drive/MyDrive/Temp/Hindi3/Model/hindi_lang.joblib')
eng_lang=joblib.load('/content/drive/MyDrive/Temp/Hindi3/Model/eng_lang.joblib')
val_data=pd.read_csv("/content/drive/MyDrive/Temp/Hindi3/test-statements-phase2.csv")
val_data.head()
sentences=val_data['hindi']
sentences=sentences.apply(lambda x : x.strip())
#sentences=["i just want to be your highness head on"]
fp=open("/content/drive/MyDrive/Temp/Hindi3/answer_week2_v11.txt","w")

In [None]:
count=1
sentence_list=[]
translated_list=[]
for sentence in sentences:
    translated=predict_translation(model,sentence,device)
    translated=translated.split(" ")[1:-1]
    translated=" ".join(translated)
    fp.write(translated+'\n')
    sentence_list.append(sentence)
    translated_list.append(translated)
    print("sentence   : "+sentence)
    print("translated : "+translated)
    print(f"sentence : {count}")
    count+=1
fp.close()

In [None]:
valdata=data.tail(3000)
valdata.head()

Unnamed: 0,hindi,english,len_hindi,len_english,is_mixed
99320,लगभग एक snuggie के लिए कहता है,almost calls for a snuggie,7,5,True
99321,हम सुरक्षा और गबन के रखरखाव के लिए यहाँ हैं,were here for security and drone maintenance,10,7,False
99322,और ऐसा इसलिए नहीं है कि मैं उस विद्यार्थी से ब...,this is not because i am a better person than ...,30,26,False
99323,ये एक अद्भुत आकृति है लेकिन हमारे पास उतने आंक...,so theres an interesting pattern but we dont h...,20,17,False
99324,और वाकई बहुत से लोगों के दिल में यह बात घर कर ...,and certainly lots of people have taken to hea...,24,19,False


In [None]:
sentences_hi=valdata['hindi']
sentences_en=valdata['english']

In [None]:
sentences_en.head()

99320                           almost calls for a snuggie
99321         were here for security and drone maintenance
99322    this is not because i am a better person than ...
99323    so theres an interesting pattern but we dont h...
99324    and certainly lots of people have taken to hea...
Name: english, dtype: object

In [None]:
import itertools
count=1
sentence_en_list=[]
translated_list=[]
for (sentence, sentence_en) in itertools.zip_longest(sentences_hi, sentences_en):
#for sentence in sentences_hi:
    translated=predict_translation(model,sentence,device)
    translated=translated.split(" ")[1:-1]
    translated=" ".join(translated)
    #fp.write(translated+'\n')
    sentence_en_list.append(sentence_en)
    translated_list.append(translated)
    print("sentence   : "+sentence)
    print("translated : "+translated)
    print("sentence_en : "+sentence_en)
    print(f"sentence : {count}")
    count+=1
fp.close()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
sentence   : गोराबाल 
translated : greg they dont want want to see you know
sentence_en : blond hair
sentence : 1751
sentence   : आप इससे सहमत हैं
translated : youre gonna get rid hes not gonna be a
sentence_en : so you agree with that
sentence : 1752
sentence   : और फिर भी प्रति व्यक्ति मांस की खपत पूरे इतिहास में सबसे ज्यादा है ।
translated : and there is there is not gonna be careful
sentence_en : and yet per capita meat consumption is as high as its been in recorded history
sentence : 1753
sentence   : मेरे दर्शन का तीसरा पहलू हैं आगे बढ़ते रहना
translated : they are you know the other
sentence_en : the third aspect to the philosophy is keep moving forward
sentence : 1754
sentence   : आपकी 08 00 अपने रास्ते पर है
translated : hes so not just gonna take care of course
sentence_en : your 800 am is on his way up
sentence : 1755
sentence   : ऐ
translated : laughter he didnt mean must be careful a lot
sentence_en : whoa hey

In [None]:
import pandas as pd
df1=pd.Series(sentence_en_list).to_frame()
df2=pd.Series(translated_list).to_frame()
df1.shape,df2.shape

((3000, 1), (3000, 1))

In [None]:
#sentence_en_list=[]
#translated_list=[]

In [None]:
def Splitfunction(strval):
  return strval.split()

In [None]:
df1["sentence"]=df1[0].apply(Splitfunction)
df2["translated"]=df2[0].apply(Splitfunction)

In [None]:
df1["sentence"]

0                        [almost, calls, for, a, snuggie]
1       [were, here, for, security, and, drone, mainte...
2       [this, is, not, because, i, am, a, better, per...
3       [so, theres, an, interesting, pattern, but, we...
4       [and, certainly, lots, of, people, have, taken...
                              ...                        
2995                       [were, fighting, uphill, here]
2996                           [a, year, alone, come, on]
2997             [and, you, know, my, mother, taught, us]
2998                              [since, i, was, a, boy]
2999             [whered, you, get, something, that, big]
Name: sentence, Length: 3000, dtype: object

In [None]:
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
# Prepare the reference sentences and candidate sentences for multiple translations
#references = [['I', 'love', 'eating', 'ice', 'cream'], ['He', 'enjoys', 'eating', 'cake']]
#translations = [['I', 'love', 'eating', 'ice', 'cream'], ['He', 'likes', 'to', 'eat', 'cake']]

# Create a list of reference lists
#references_list = [[ref] for ref in references]

# Calculate BLEU score for the entire corpus
bleu_score_corpus = corpus_bleu(list(df1["sentence"]), list(df2["translated"]),weights=(0, 0, 0, 1))
print("Corpus BLEU Score: ", bleu_score_corpus)

bleu1 = corpus_bleu(list(df1["sentence"]), list(df2["translated"]), weights=(1.0, 0, 0, 0))
bleu2 = corpus_bleu(list(df1["sentence"]), list(df2["translated"]), weights=(0.5, 0.5, 0, 0))
bleu3 = corpus_bleu(list(df1["sentence"]), list(df2["translated"]), weights=(0.3, 0.3, 0.3, 0))
bleu4 = corpus_bleu(list(df1["sentence"]), list(df2["translated"]), weights=(0.25, 0.25, 0.25, 0.25))
print('BLEU-1: {:.4f}'.format(bleu1))
print('BLEU-2: {:.4f}'.format(bleu2))
print('BLEU-3: {:.4f}'.format(bleu3))
print('BLEU-4: {:.4f}'.format(bleu4))



The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Corpus BLEU Score:  2.2250738585072626e-308
BLEU-1: 0.0602
BLEU-2: 0.0016
BLEU-3: 0.0000
BLEU-4: 0.0000


In [None]:
!pip install jiwer
!pip install nltk
!pip install sacrebleu
!pip install sacremoses


In [None]:
!pip install evaluate

In [None]:
list(df1[0])

In [None]:
import evaluate
bleu = evaluate.load('bleu')

references=[ [li] for li in list(df1[0])]
predictions=list(df2[0])

results = bleu.compute(predictions=predictions, references=references)
print(results)

{'bleu': 0.00788679600497459, 'precisions': [0.122828978250665, 0.02003190923595107, 0.00424248619914128, 0.000603536725209729], 'brevity_penalty': 0.8852456002789404, 'length_ratio': 0.891352859135286, 'translation_length': 25564, 'reference_length': 28680}
