## Naive Machine Translation

In this notebook, we aim to create a simple English to French translation algorithm using word embeddings and vector space models.

### Translation as Linear Transformation of Embeddings

Given dictionaries of English and French word embeddings, we can create a transformation matrix `R`. For a given English word embedding, `e`, we can multiply `eR` to get a new word embedding `f`. Both `e` and `f` are row vectors. We can then compute the nearest neighbors to `f` in the French embeddings and recommend the word that is most similar to the transformed word embedding.

### Problem Formulation

The goal is to find a matrix `R` that minimizes the following equation:

$$\arg \min _{\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|_{F}\tag{1} $$

This equation represents the minimization problem where we aim to find the transformation matrix `R` that when applied to the English word embeddings (`X`), minimizes the difference between the transformed embeddings and the French word embeddings (`Y`).

### Frobenius Norm

The Frobenius norm of a matrix `A` (assuming it is of dimension `m,n`) is defined as the square root of the sum of the absolute squares of its elements:

$$\|\mathbf{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}\tag{2}$$

In the context of our problem, the Frobenius norm is used to calculate the difference between the transformed English word embeddings and the French word embeddings. The goal is to minimize this difference.

### Word Embeddings for English and French

In order to translate English to French, we need word embeddings for both languages. These embeddings are essentially vector representations of words that capture their meanings.

#### Obtaining the Data

The complete English embeddings dataset is approximately 3.64 GB, while the French embeddings dataset is around 629 MB. To avoid overloading the workspace, we will be using a subset of these embeddings for this assignment.

However, if you wish to utilize the full datasets on your local machine, you can obtain them as follows:

- The English embeddings can be downloaded from the Google code archive's word2vec page. Look for the file named [GoogleNews-vectors-negative300.bin.gz](https://code.google.com/archive/p/word2vec/). Remember to unzip the file after downloading.

- The French embeddings can be acquired from the [cross_lingual_text_classification](https://github.com/vjstark/crosslingual_text_classification) repository. You can download the file directly using the following command in your terminal:
    ```
    curl -o ./wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec
    ```

After downloading and unzipping (if necessary) these files, you can load the embeddings into your program.

In [15]:
!pip install lightning

Collecting lightning
  Downloading lightning-2.1.3-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torchmetrics<3.0,>=0.7.0
  Downloading torchmetrics-1.2.1-py3-none-any.whl (806 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.1/806.1 kB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tqdm<6.0,>=4.57.0
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities<2.0,>=0.8.0
  Downloading lightning_utilities-0.10.0-py3-none-any.whl (24 kB)
Collecting fsspec[http]<2025.0,>=2022.5.0
  Downloading fsspec-2023.12.2-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.0/169.0 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyYAML<8.0,>

In [28]:
import os
import pickle
import numpy as np
from gensim.models import KeyedVectors
from google_drive_downloader import download_file_from_google_drive
from lightning import LightningDataModule, LightningModule, Trainer
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm


ModuleNotFoundError: No module named 'lightning'

In [None]:
# Download English embeddings
if not os.path.exists('./GoogleNews-vectors-negative300.bin'):
    for i, chunk_size in download_file_from_google_drive('0B7XkCwpI5KDYNlNUTTlSS21pQmM', './GoogleNews-vectors-negative300.bin.gz'):
        print("Downloaded %d bytes" % (i * chunk_size))
    !gunzip GoogleNews-vectors-negative300.bin.gz

# Download French embeddings
if not os.path.exists('./wiki.multi.fr.vec'):
    !curl -o ./wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec

# Load embeddings using gensim
en_embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
fr_embeddings = KeyedVectors.load_word2vec_format('./wiki.multi.fr.vec')

Downloaded 0 bytes
Downloaded 32768 bytes
Downloaded 65536 bytes
Downloaded 98304 bytes
Downloaded 131072 bytes
Downloaded 163840 bytes
Downloaded 196608 bytes
Downloaded 229376 bytes
Downloaded 262144 bytes
Downloaded 294912 bytes
Downloaded 327680 bytes
Downloaded 360448 bytes
Downloaded 393216 bytes
Downloaded 425984 bytes
Downloaded 458752 bytes
Downloaded 491520 bytes
Downloaded 524288 bytes
Downloaded 557056 bytes
Downloaded 589824 bytes
Downloaded 622592 bytes
Downloaded 655360 bytes
Downloaded 688128 bytes
Downloaded 720896 bytes
Downloaded 753664 bytes
Downloaded 786432 bytes
Downloaded 819200 bytes
Downloaded 851968 bytes
Downloaded 884736 bytes
Downloaded 917504 bytes
Downloaded 950272 bytes
Downloaded 983040 bytes
Downloaded 1015808 bytes
Downloaded 1048576 bytes
Downloaded 1081344 bytes
Downloaded 1114112 bytes
Downloaded 1146880 bytes
Downloaded 1179648 bytes
Downloaded 1212416 bytes
Downloaded 1245184 bytes
Downloaded 1277952 bytes
Downloaded 1310720 bytes
Downloaded 134