In [604]:
### To maintain the packages and the the used versions
# !pip install pipreqsnb
!pipreqsnb --force ./

pipreqs  --force ./


INFO: Not scanning for jupyter notebooks.
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
INFO: Successfully saved requirements file in ./requirements.txt


In [7]:
# pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Data Preparation

## Loading data : English to Hindi translation pairs from the MUSE dataset by Meta.

In [1]:
import pandas as pd
import numpy as np
import joblib

In [2]:
train_path = "data/train.txt"
test_path = "data/test.txt"
data_path = "data/en-hi.txt"

We are loading the text pairs into a pandas dataframe for easy access. Reading with Pandas was chosen over directly loading from the file as the system could not recognise the Devanagri script if loaded directly and was throwing errors. 

<!-- Separate train and test files were downloaded from the github repository and loaded accordingly. The train dataset contains 8704 text pairs and the test set has 2032 pairs. Sample of the dataset is also shown. Each entity of the dataframe is a string. The dataframes are then converted to dictionaries for ease of access. -->

The entire dataset contains 38216 pairs of english-hindi translations. 80% of the data is to be used for training and the rest 20% for testing.

In [5]:
data_df = pd.read_csv(data_path, sep = '\t', header = None)
data_df = data_df.dropna()
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38216 entries, 0 to 38220
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       38216 non-null  object
 1   1       38216 non-null  object
dtypes: object(2)
memory usage: 895.7+ KB


In [7]:
train_df = data_df.loc[:len(data_df)*0.8]
test_df = data_df.loc[len(data_df)*0.8:]

print("Train Data Info", train_df.info())
print("Test Data Info", test_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 30568 entries, 0 to 30572
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       30568 non-null  object
 1   1       30568 non-null  object
dtypes: object(2)
memory usage: 716.4+ KB
Train Data Info None
<class 'pandas.core.frame.DataFrame'>
Index: 7648 entries, 30573 to 38220
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7648 non-null   object
 1   1       7648 non-null   object
dtypes: object(2)
memory usage: 179.2+ KB
Test Data Info None


In [553]:
# train_df = pd.read_csv(train_path, sep = '\t', header = None)
# print(train_df.info())

# test_df = pd.read_csv(test_path, sep = '\t', header = None)
# print(test_df.info())

In [9]:
train_df.head()

Unnamed: 0,0,1
0,and,और
1,was,था
2,was,थी
3,for,लिये
4,that,उस


## Loading pre-trained Fasttext monolingual embeddings

To get the word vectors for different words in both Hindi and English, the embedding models are downloaded from [here](https://fasttext.cc/docs/en/pretrained-vectors.html). The word vectors for all the words from the dataset, both the train and test sets, are generated and stored for future references in two separate lists.

In [11]:
# !pip install fasttext-wheel
import fasttext
import fasttext.util

In [13]:
hindi_embed = fasttext.load_model('Embeddings/wiki.hi.bin')     ### Embedding model for Hindi words
english_embed = fasttext.load_model('Embeddings/wiki.en.bin')   ### Embedding model for English words

In [14]:
english_words = list(train_df[0]) + list(test_df[0])            ### All the English words in corpus
hindi_words = list(train_df[1]) + list(test_df[1])              ### All the Hindi words in the corpus

In [15]:
print(len(english_words), len(hindi_words))

38216 38216


In [48]:
### Use of dictionaries is discarded to preserve the duplicate entries
# english_vectors = {word : english_embed.get_word_vector(word) for word in english_words}
# hindi_vectors = {word: hindi_embed.get_word_vector(word) for word in hindi_words}    

In [44]:
### To load the word vectors. To create the embedding word vectors from scratch, run the commented out lines in this cell

english_vectors = joblib.load("vars/english_vectors")
hindi_vectors = joblib.load("vars/hindi_vectors")

# english_vectors = [english_embed.get_word_vector(word) for word in english_words]
# hindi_vectors = [hindi_embed.get_word_vector(word) for word in hindi_words]

In [46]:
### To store the list of word vectors

# joblib.dump(english_vectors, "vars/english_vectors")
# joblib.dump(hindi_vectors, "vars/hindi_vectors")

In [188]:
### Dimension of embeddings

dim = len(list(english_vectors)[0])
dim

300

# Embedding Alignment: Applying orthogonal Procrustes method

For the calculation of the solution of the Procrustes alignment, the train set containing embedding word vectors is constructed into matrices of dimensions (*numSamples , dimension of embedding*), i.e. (8704, 300) for our case here.

The Procrustes method utilises Singular Value Decomposition to compute the best transformation to align the source matrix to the target matrix. Once the solution **W** is computed, the transformation is applied to the embedding of an English word ($s_e$) to get a resulting transformed vector ($W \cdot s_e$). According to [Conneau et al.](https://arxiv.org/pdf/1710.04087), the corresponding word from the resulting embedding is given by the Hindi word ($h$), whose embedding has the highest cosine similarity with the transformed source vector embedding.

$$
t = \arg\max_t \cos(W \cdot x_s, y_t)
$$


In [17]:
from scipy.linalg import orthogonal_procrustes
from scipy.linalg import svd

In [18]:
def createMatrices(n, english_vectors, hindi_vectors):
    english_matrix = np.empty((0, dim))
    hindi_matrix = np.empty((0, dim))
    
    for i in range(n):
        english_matrix = np.vstack([english_matrix, english_vectors[i].reshape((1, 300))])
        hindi_matrix = np.vstack([hindi_matrix, hindi_vectors[i].reshape((1, 300))])

    return english_matrix, hindi_matrix

In [19]:
### To save computation time, precomputed matrices are stored for ease of execution. To recompute the matrices, line commented below

# english_matrix, hindi_matrix = createMatrices(len(train_data), english_vectors, hindi_vectors)

english_matrix = joblib.load("vars/english_matrix")
hindi_matrix = joblib.load("vars/hindi_matrix")

In [20]:
### Code to store the variables

# joblib.dump(english_matrix, "vars/english_matrix")
# joblib.dump(hindi_matrix, "vars/hindi_matrix")

In [21]:
english_matrix.shape, hindi_matrix.shape

((30568, 300), (30568, 300))

In [22]:
# A = hindi_matrix @ (english_matrix.T)
# U, s, V = svd(A, full_matrices=0)
# R = U @ V.T
# W = np.multiply((1 + 0.01), R) - np.multiply(0.01, R @ R.T) @ R

#### The above is the exact implementation of the code from the paper. However due to the large sizes of the English and Hindi matrices, the SVD 
#### was not being performed

W, A = orthogonal_procrustes(english_matrix, hindi_matrix)

In [23]:
W.shape

(300, 300)

# Evaluation

## Function for translating English words to Hindi

The function takes in the English word to be translated, the transformation matrix from the Procrustes problem with orthogonality, the model which converts English words to their corresponding word vector and all the Hindi word vectors, to return the Hindi translation of the English word.

In [50]:
def translateWord(english_word, R, english_model, hindi_vectors):        

    english_vector = english_embed.get_word_vector(english_word)
    translation_vector = np.dot(english_vector, R)
    i = max(np.arange(0, len(hindi_vectors)-1), key=lambda i: np.dot(hindi_vectors[i], translation_vector))
    
    return hindi_words[i]

In [52]:
translateWord('november', W, english_embed, hindi_vectors) ### The translation is not very accurate

'फ़रवरी'

In [592]:
def calAccuracy(df):
    
    hindi_translated = []
    
    for word in df[0]:
        t = translateWord(word, W, english_embed, hindi_vectors)
        hindi_translated.append(t)
        
    trueHindi = list(df[1])
    preds = []
    
    for i in range(len(trueHindi)):
        if trueHindi[i] == hindi_translated[i]:
            preds.append(1)
        else:
            preds.append(0)
            
    return (sum(preds)/len(preds))

## Accuracy calculation

In [594]:
trainAcc = calAccuracy(train_df)
testAcc = calAccuracy(test_df)

print("The accuracy of translation by the model on the training set is {:.4f}".format(trainAcc))
print("The accuracy of translation by the model on the test set is {:.4f}".format(testAcc))

The accuracy of translation by the model on the training set is 0.1273
The accuracy of translation by the model on the test set is 0.0088


## Precision@k

Precsion@k metric counts the number of true predictions in the top 'k' predicted words as translations for the given word. We are interested in Precision@1 and Precision@5 metrics. Precision@1 could also be considered as accuracy as we are considering the top 1 prediction as the translation. The computation of Precision@5 is done below.

In [54]:
def topTranslateWords(english_word, R , english_model, hindi_vectors, k):        

    english_vector = english_embed.get_word_vector(english_word)
    translation_vector = np.dot(english_vector, R)
    indices = sorted(np.arange(0, len(hindi_vectors)-1), key=lambda i: np.dot(hindi_vectors[i], translation_vector), reverse = True)[:k]
    
    return [hindi_words[i] for i in indices]

In [56]:
topTranslateWords('november', W, english_embed, hindi_vectors, 5)     #### the dataset is not perfect and has english words as hindi translations

['फ़रवरी', 'फ़रवरी', 'नवम्बर', 'नवम्बर', 'दिसम्बर']

In [58]:
def calPrecision(df, k):
    topAns = []
    
    for word in df[0]:
        t = topTranslateWords(word, W, english_embed, hindi_vectors, k)
        topAns.append(t)
        
    trueHindi = list(df[1])
    preds = []
    
    for i in range(len(trueHindi)):
        if trueHindi[i] in topAns[i]:
            preds.append(1)
        else:
            preds.append(0)
            
    return (sum(preds)/len(preds))

In [60]:
trainAcc = calPrecision(train_df, 5)
testAcc = calPrecision(test_df, 5)

print("The accuracy of translation by the model on the training set is {:.4f}".format(trainAcc))
print("The accuracy of translation by the model on the test set is {:.4f}".format(testAcc))

The accuracy of translation by the model on the training set is 0.2575
The accuracy of translation by the model on the test set is 0.0395
