## Data Preparation

Firstly, we download pre-trained embeddings in English and Hindi and the bilingual dictionary from MUSE and prepare the data for alignment

In [1]:
# download pre-trained English word embeddings
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz

--2025-04-18 15:22:58--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.157.254.102, 108.157.254.121, 108.157.254.124, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.157.254.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz’


2025-04-18 15:23:08 (129 MB/s) - ‘cc.en.300.vec.gz’ saved [1325960915/1325960915]



In [2]:
# download pre-trained Hindi word embeddings
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.vec.gz

--2025-04-18 15:23:38--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.157.254.124, 108.157.254.15, 108.157.254.102, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.157.254.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1118942272 (1.0G) [binary/octet-stream]
Saving to: ‘cc.hi.300.vec.gz’


2025-04-18 15:24:46 (15.9 MB/s) - ‘cc.hi.300.vec.gz’ saved [1118942272/1118942272]



We load top 100,000 embeddings from English and Hindi in decreasing order of frequency.

In [5]:
# load the embeddings
import gzip
import numpy as np

def load_embeddings(file_path, top_n=100000):
    embeddings = {}
    with gzip.open(file_path, 'rb') as f:
        for i, line in enumerate(f):
            if i == 0:
                continue
            if i > top_n:
                break
            tokens = line.decode('utf-8').strip().split(' ')
            word = tokens[0]
            vector = np.array(tokens[1:], dtype=np.float32)
            vector = vector / np.linalg.norm(vector)
            embeddings[word] = vector
    return embeddings

en_embeddings = load_embeddings('cc.en.300.vec.gz', top_n=100000)
hi_embeddings = load_embeddings('cc.hi.300.vec.gz', top_n=100000)

print(f"Loaded {len(en_embeddings)} English embeddings")
print(f"Loaded {len(hi_embeddings)} Hindi embeddings")

Loaded 100000 English embeddings
Loaded 100000 Hindi embeddings


In [4]:
# download english-hindi dictionary from MUSE
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.txt

--2025-04-18 15:25:24--  https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.157.254.124, 108.157.254.121, 108.157.254.102, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.157.254.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 930856 (909K) [text/x-c++]
Saving to: ‘en-hi.txt’


2025-04-18 15:25:26 (1.13 MB/s) - ‘en-hi.txt’ saved [930856/930856]



In [7]:
# display English-Hindi pairs
def load_bilingual_lexicon(file_path):
    bilingual_dict = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            en_word, hi_word = line.strip().split()
            bilingual_dict.append((en_word, hi_word))
    return bilingual_dict

en_hi_pairs = load_bilingual_lexicon('en-hi.txt')
print(en_hi_pairs[:10])

[('and', 'और'), ('was', 'था'), ('was', 'थी'), ('for', 'लिये'), ('that', 'उस'), ('that', 'कि'), ('with', 'साथ'), ('from', 'से'), ('from', 'इससे'), ('this', 'ये')]


In [8]:
# Extract word embeddings for bilingual word pairs
import numpy as np

def extract_word_embeddings(bilingual_pairs, en_embeddings, hi_embeddings):
    en_vecs = []
    hi_vecs = []

    for en_word, hi_word in bilingual_pairs:
        if en_word in en_embeddings and hi_word in hi_embeddings:
            en_vecs.append(en_embeddings[en_word])
            hi_vecs.append(hi_embeddings[hi_word])

    en_vecs = np.array(en_vecs)
    hi_vecs = np.array(hi_vecs)

    return en_vecs, hi_vecs

en_vecs, hi_vecs = extract_word_embeddings(en_hi_pairs, en_embeddings, hi_embeddings)
print(f"Extracted {en_vecs.shape[0]} aligned word vectors.")

Extracted 18972 aligned word vectors.


## Cross-Lingual Alignment

### Procrustes Alignment

We perform orthogonal Procrustes alignment to learn a mapping from X to Y, where X is numpy array having source language word embeddings (English) and Y is numpy array having target language word embeddings (Hindi).

In [9]:
def orthogonal_procrustes(X, Y):
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    Y = Y / np.linalg.norm(Y, axis=1, keepdims=True)
    M = np.dot(X.T, Y)

    U, _, Vt = np.linalg.svd(M)
    W = np.dot(U, Vt)

    return W

W = orthogonal_procrustes(en_vecs, hi_vecs)

print("Orthogonal mapping matrix learned.")

Orthogonal mapping matrix learned.


In [10]:
# Apply learned mapping to the source language embeddings

def apply_mapping(embeddings, W):
    mapped_embeddings = {}
    for word, vec in embeddings.items():
        mapped_vec = np.dot(vec, W)
        mapped_vec = mapped_vec / np.linalg.norm(mapped_vec) # normalize the mapped vector
        mapped_embeddings[word] = mapped_vec
    return mapped_embeddings

aligned_en_embeddings = apply_mapping(en_embeddings, W)

print(f"Aligned {len(aligned_en_embeddings)} English embeddings into the Hindi space.")

Aligned 100000 English embeddings into the Hindi space.


## Evaluation

### Word Translation

We now perform word translation where we translate a limited number of words from English to Hindi using the aligned embeddings. I have limited size of en_words to 2000 since I previously experimented with 3000 words and session crashed terminated after execution of > 2 hours.


In [11]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def translate_words(aligned_en_embeddings, hi_embeddings, top_k=5, limit_size=None):
    translations = {}
    hi_words = list(hi_embeddings.keys())
    hi_vecs = np.array(list(hi_embeddings.values()))

    en_words = list(aligned_en_embeddings.keys())
    if limit_size is not None:
        en_words = en_words[:limit_size]

    for en_word in en_words:
        en_vec = aligned_en_embeddings[en_word]
        en_vec = en_vec / np.linalg.norm(en_vec) # calculate cosine similarity
        hi_vecs_norm = hi_vecs / np.linalg.norm(hi_vecs, axis=1, keepdims=True)
        similarities = cosine_similarity([en_vec], hi_vecs_norm).flatten()
        nearest_idxs = similarities.argsort()[-top_k:][::-1] # get top_k most similar Hindi words
        nearest_words = [hi_words[i] for i in nearest_idxs]

        translations[en_word] = nearest_words

    return translations

limit_size = 2000
translations = translate_words(aligned_en_embeddings, hi_embeddings, top_k=5, limit_size=limit_size)

for en_word, hi_words in list(translations.items())[:10]:
    print(f"English: {en_word} -> Hindi: {hi_words}")


English: , -> Hindi: [',', 'और', 'कि', 'हैं', '।']
English: the -> Hindi: ['में', 'पहले', 'सबसे', 'अपने', 'जिस']
English: . -> Hindi: ['तो', 'ही', '*', 'आज', '.']
English: and -> Hindi: ['तथा', 'साथ', 'एवं', 'और', 'हैं']
English: to -> Hindi: ['करके', 'करने', 'करना', 'करें', 'करते']
English: of -> Hindi: ['में', 'प्रति', 'तथा', 'सबसे', 'अधीन']
English: a -> Hindi: ['बड़ा', 'दूसरा', 'बड़ा', 'पहला', 'छोटा']
English: </s> -> Hindi: ['▲', 'è', 'bss', 'pd', '▓']
English: in -> Hindi: ['में', 'बाहर', 'जहां', 'मे', 'लाने']
English: is -> Hindi: ['है', 'यह', 'होता', 'क्योंकि', 'जो']


### Translation accuracy and Precision using MUSE dictionary

We calculate Translation accuracy, P@1 and P@5 to evaluate the aligned embeddings

In [12]:
def load_muse_test_dict(file_path):
    test_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            words = line.strip().split()
            if len(words) == 2:
                test_dict[words[0]] = words[1]
    return test_dict

In [13]:
def evaluate_translation(translations, test_dict, top_k=5):
    true_positives_at_1 = 0
    true_positives_at_5 = 0
    false_positives_at_1 = 0
    false_positives_at_5 = 0
    correct_predictions = 0
    total_predictions = 0

    for en_word, correct_hi_word in test_dict.items():
        predicted_hi_words = translations.get(en_word, [])

        if len(predicted_hi_words) > 0:
            total_predictions += 1

            # calculate Precision@1
            if correct_hi_word == predicted_hi_words[0]:
                true_positives_at_1 += 1
                correct_predictions += 1
            else:
                false_positives_at_1 += 1

            # calculate Precision@5
            if correct_hi_word in predicted_hi_words[:top_k]:
                true_positives_at_5 += 1

                # Only count for accuracy if it hasn't been counted for Precision@1
                if correct_hi_word != predicted_hi_words[0]:
                    correct_predictions += 1
            else:
                false_positives_at_5 += 1

    precision_at_1 = true_positives_at_1 / (true_positives_at_1 + false_positives_at_1) if (true_positives_at_1 + false_positives_at_1) > 0 else 0
    precision_at_5 = true_positives_at_5 / (true_positives_at_5 + false_positives_at_5) if (true_positives_at_5 + false_positives_at_5) > 0 else 0
    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0

    return precision_at_1, precision_at_5, accuracy

test_dict = load_muse_test_dict('en-hi.txt')
precision_at_1, precision_at_5, accuracy = evaluate_translation(translations, test_dict)

print(f"Precision@1: {precision_at_1:.4f}")
print(f"Precision@5: {precision_at_5:.4f}")
print(f"Accuracy: {accuracy:.4f}")

Precision@1: 0.3827
Precision@5: 0.6675
Accuracy: 0.6675


### Cosine similarity
Now, we calculate cosine similarity to measure the similarity between the embeddings in English and in Hindi, thereby getting a quantitative idea about alignment quality

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

def compute_cosine_similarities(en_word_pairs, en_embeddings, hi_embeddings, num_pairs=50):
    similarities = {}
    count = 0

    for en_word, hi_word in en_word_pairs:
        if en_word in en_embeddings and hi_word in hi_embeddings:
            en_vec = en_embeddings[en_word]
            hi_vec = hi_embeddings[hi_word]
            en_vec = en_vec / np.linalg.norm(en_vec)
            hi_vec = hi_vec / np.linalg.norm(hi_vec)
            similarity = cosine_similarity([en_vec], [hi_vec])[0][0]
            similarities[(en_word, hi_word)] = similarity

            count += 1
            if count >= num_pairs:
                break

    return similarities

cosine_similarities = compute_cosine_similarities(en_hi_pairs, en_embeddings, hi_embeddings, num_pairs=50) # limit number of pairs to 50
for (en_word, hi_word), similarity in cosine_similarities.items():
    print(f"English: {en_word}, Hindi: {hi_word}, Similarity: {similarity:.4f}")

English: and, Hindi: और, Similarity: 0.0755
English: was, Hindi: था, Similarity: -0.0464
English: was, Hindi: थी, Similarity: 0.0072
English: for, Hindi: लिये, Similarity: -0.0317
English: that, Hindi: उस, Similarity: -0.0120
English: that, Hindi: कि, Similarity: -0.0811
English: with, Hindi: साथ, Similarity: 0.0568
English: from, Hindi: से, Similarity: 0.1069
English: from, Hindi: इससे, Similarity: 0.0309
English: this, Hindi: ये, Similarity: -0.1552
English: this, Hindi: यह, Similarity: -0.1453
English: this, Hindi: इस, Similarity: -0.2058
English: his, Hindi: उसकी, Similarity: 0.0269
English: his, Hindi: उसका, Similarity: 0.0216
English: his, Hindi: उसके, Similarity: 0.0578
English: not, Hindi: नही, Similarity: 0.0132
English: not, Hindi: नहीं, Similarity: 0.0316
English: are, Hindi: हैं, Similarity: -0.0422
English: talk, Hindi: बात, Similarity: -0.0482
English: which, Hindi: जिससे, Similarity: 0.0348
English: also, Hindi: भी, Similarity: -0.0750
English: has, Hindi: रै, Similarity

### (Additional) Ablation Study

We now explore how the size of the bilingual lexicon affects the alignment of English and Hindi word embeddings and the quality of word translation.

For this, we use Procrustes alignment and result is evaluated using precision@1, precision@5, and accuracy.

In [16]:
def perform_ablation_study(en_embeddings, hi_embeddings, lexicon_sizes=[5000, 10000]):
    results = {}
    all_pairs = load_bilingual_lexicon('en-hi.txt')

    for size in lexicon_sizes:
        print(f"Performing alignment with lexicon size: {size}")
        en_hi_pairs = all_pairs[:size] # use a subset of lexicon
        en_vecs, hi_vecs = extract_word_embeddings(en_hi_pairs, en_embeddings, hi_embeddings)
        # Procrustes alignment
        W = orthogonal_procrustes(en_vecs, hi_vecs)
        # apply the learned mapping to all English word embeddings
        aligned_en_embeddings = apply_mapping(en_embeddings, W)
        translations = translate_words(aligned_en_embeddings, hi_embeddings, top_k=5,limit_size=2000)
        test_dict = load_muse_test_dict('en-hi.txt')
        precision_at_1, precision_at_5, accuracy = evaluate_translation(translations, test_dict)
        results[size] = (precision_at_1, precision_at_5, accuracy)

    return results

ablation_results = perform_ablation_study(en_embeddings, hi_embeddings)

for size, (p1, p5, acc) in ablation_results.items():
    print(f"Lexicon size: {size}")
    print(f"  Precision@1: {p1:.4f}")
    print(f"  Precision@5: {p5:.4f}")
    print(f"  Accuracy: {acc:.4f}")

Performing alignment with lexicon size: 5000
Performing alignment with lexicon size: 10000
Lexicon size: 5000
  Precision@1: 0.4121
  Precision@5: 0.7208
  Accuracy: 0.7208
Lexicon size: 10000
  Precision@1: 0.3962
  Precision@5: 0.7080
  Accuracy: 0.7080
