<center> <font size = 24 color = 'steelblue'> <b>Machine Translation<br>
<img src = "https://drive.google.com/uc?export=view&id=1kDOK8t-HSazBsqscjf11ml0NFlhiQxRh" width = 600>

# <a id= 'f0'>
<font size = 4>
    
**Table of Contents:**<br>
[1. Introduction](#f1)<br>
[2. Loading libraries](#f2)<br>
[3. Loading embeddings](#f3)<br>
[4. Translating English dictionary to French](#f4)<br>
> [4.1 Working with embeddings](#f4.1)<br>
> [4.2 Computing the gradient of loss in respect to transform matrix R](#f4.2)<br>
[3. Cosine Similarity](#f3)<br>

##### <a id = 'f1'>
<font size = 10 color = 'midnightblue'> **Introduction**

<div class="alert alert-block alert-success">
<font size = 4>

- Machine translation involves the use of automated systems to translate text or speech from one language to another.
- NLP plays a crucial role in understanding, interpreting, and generating human language in a way that considers context and meaning.
- NLP techniques are employed to enhance the quality and accuracy of machine translation systems.
- NLP helps in addressing linguistic nuances, context understanding, and idiosyncrasies specific to each language.

##### <a id = 'f2'>
<font size = 10 color = 'midnightblue'> **Load the Libraries**

In [44]:
# %conda install -c conda-forge pickle5
%conda install -c conda-forge python-dotenv

Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/homebrew/anaconda3/envs/nlp-play

  added / updated specs:
    - python-dotenv


The following NEW packages will be INSTALLED:

  python-dotenv      conda-forge/noarch::python-dotenv-1.0.1-pyhd8ed1ab_0 



Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.


In [15]:
import nltk
import pdb
import pickle
import string
import pandas as pd
import time
import gensim
import matplotlib.pyplot as plt
import numpy as np
import scipy

from gensim.models import KeyedVectors
from nltk.corpus import stopwords, twitter_samples
from nltk.tokenize import TweetTokenizer
import re
from nltk.stem import PorterStemmer

from sklearn.metrics.pairwise import cosine_similarity

In [None]:
import os
from dotenv import load_dotenv

try:
    from google.colab import drive
    # drive.mount('/content/drive')
    load_dotenv(verbose=True, dotenv_path='.env', override=True)
    DATASET_PATH = os.getenv('COLAB_DATASET_PATH')
    print("MF Running in Colab environment")
except ModuleNotFoundError:
    load_dotenv(verbose=True, dotenv_path='.env', override=True)
    DATASET_PATH = os.getenv('DATASET_PATH', default='/default/dataset/path')
    SUPPORTING_FILES_PATH = os.getenv('SUPPORTING_FILES_PATH', default='/default/supporting/dataset/path') 
    print("MF Running in local environment")
    print(f"DATASET_PATH: {DATASET_PATH}")
    print(f"SUPPORTING_FILES_PATH: {SUPPORTING_FILES_PATH}")

In [17]:
from nltk.tokenize import word_tokenize

data = "This is a tweet with #hashtag and @mention! 😊"
tw = TweetTokenizer()
print("word_tokenizer-----", word_tokenize(data))
print("TweetTokenizer-----", tw.tokenize(data))

word_tokenizer----- ['This', 'is', 'a', 'tweet', 'with', '#', 'hashtag', 'and', '@', 'mention', '!', '😊']
TweetTokenizer----- ['This', 'is', 'a', 'tweet', 'with', '#hashtag', 'and', '@mention', '!', '😊']


In [4]:
nltk.download('stopwords')
nltk.download('twitter_samples')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/toddwalters/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/toddwalters/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [5]:
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [18]:
data = twitter_samples.strings('positive_tweets.json')
data

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days',
 '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM',
 "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI",
 '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.',
 'Jgh , but we have to go to Bayan :D bye',
 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing 

##### <a id = 'f3'>
<font size = 10 color = 'midnightblue'> **Load English and French Embeddings**

In [51]:
en_emb_subset = pickle.load(open(f"{SUPPORTING_FILES_PATH}/en_embeddings.p", 'rb'))
fr_emb_subset = pickle.load(open(f"{SUPPORTING_FILES_PATH}/fr_embeddings.p", 'rb'))

In [52]:
file =  pd.read_csv(f'{DATASET_PATH}/en-fr.train.txt', delimiter = ' ', header =None, index_col = [0]).squeeze('columns')
eng_to_fr_dict_train =  file.to_dict()

# eng_to_fr_dict_train

In [53]:
len(eng_to_fr_dict_train)

5000

In [54]:
file2 =  pd.read_csv(f'{DATASET_PATH}/en-fr.test.txt', delimiter = ' ', header =None, index_col = [0]).squeeze('columns')
eng_to_fr_dict_test =  file2.to_dict()

# eng_to_fr_dict_test

In [55]:
len(en_emb_subset)

6370

[top](#f0)

##### <a id= 'f4'>
<font size = 10 color = 'midnightblue'> **Translating English Dictionary to French** <br>


##### <a id = 'f4.1'>
<font size = 6 color = 'pwdrblue'> <b>Working with embeddings

<div class="alert alert-block alert-success">
<font size = 4>
    
- Generate a matrix where where the columns are the English embeddings.
- Generate a matrix where the columns correspond to the French embeddings.
- Generate the projection matrix that minimizes the F norm ||X R -Y||^2.

> - The goal is often to find a transformation matrix that minimizes the difference between two matrices.
> - The Frobenius norm is a way to measure the "size" or magnitude of a matrix.

In [56]:
# get the set of words of English

eng_words = en_emb_subset.keys()
fr_words = fr_emb_subset.keys()

<font size = 5 color = 'seagreen'> <b>Check whether embedding is present for both the English and French words present in translations dictionary

In [60]:
eng_emb =[]
frnch_emb = []

for eng, fr in eng_to_fr_dict_train.items():
    if (eng in eng_words) and (fr in fr_words):
       # get the embeddings and store
        eng_emb.append(en_emb_subset[eng])
        frnch_emb.append(fr_emb_subset[fr])

print(f'The number of English words that have a corresponding French word is: {len(eng_emb)}')
print(f'The number of French words that have a corresponding English word is: {len(frnch_emb)}')

The number of English words that have a corresponding French word is: 4932
The number of French words that have a corresponding English word is: 4932


<font size = 5 color = 'seagreen'> <b>Create English and French Embedded Matrix

In [33]:
X = np.vstack(eng_emb)
X.shape

(4932, 300)

In [34]:
Y = np.vstack(frnch_emb)
Y.shape

(4932, 300)

<font size = 5 color = 'seagreen'> <b>Translation

<div class="alert alert-block alert-success">
<font size = 4>
    
The loss function will be squared Frobenius norm of the difference between
matrix and its approximation, divided by the number of training examples $m$.
</div>

<font size = 5>
$$ L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2}$$


<font size = 4>
    
<center> where $a_{i j}$ is value in $i$th row and $j$th column of the matrix $\mathbf{XR}-\mathbf{Y}$.

##### <a id = 'f4.2'>
<font size = 6 color = 'pwdrblue'> <b>Computing the gradient of loss in respect to transform matrix R

<div class="alert alert-block alert-success">
<font size = 4>
    
* Calculate the gradient of the loss with respect to transform matrix `R`.
* The gradient is a matrix that encodes how much a small change in `R`
affect the change in the loss function.
* The gradient gives us the direction in which we should decrease `R`
to minimize the loss.
* $m$ is the number of training examples (number of rows in $X$).
* The formula for the gradient of the loss function $𝐿(𝑋,𝑌,𝑅)$ is:

$$\frac{d}{dR}𝐿(𝑋,𝑌,𝑅)=\frac{d}{dR}\Big(\frac{1}{m}\| X R -Y\|_{F}^{2}\Big) = \frac{2}{m}X^{T} (X R - Y)$$



[top](#f0)

In [35]:
R =  np.random.rand(X.shape[1], X.shape[1])
train_steps =600
learning_rate  = 0.8

for i in range(train_steps+1):
    if i%2 ==0:
        diff = (X @ R) -Y
        sq_diff = diff**2
        loss = np.sum(sq_diff)/X.shape[0]
        print(f"loss at iteration {i} is : {loss:.3f}")
    gradient =  np.dot(X.transpose(),np.dot(X,R) -Y)*(2/X.shape[0])
    R = R - learning_rate*gradient

loss at iteration 0 is : 954.520
loss at iteration 2 is : 598.257
loss at iteration 4 is : 473.639
loss at iteration 6 is : 387.056
loss at iteration 8 is : 322.781
loss at iteration 10 is : 272.972
loss at iteration 12 is : 233.240
loss at iteration 14 is : 200.903
loss at iteration 16 is : 174.198
loss at iteration 18 is : 151.896
loss at iteration 20 is : 133.103
loss at iteration 22 is : 117.147
loss at iteration 24 is : 103.513
loss at iteration 26 is : 91.796
loss at iteration 28 is : 81.676
loss at iteration 30 is : 72.895
loss at iteration 32 is : 65.245
loss at iteration 34 is : 58.555
loss at iteration 36 is : 52.684
loss at iteration 38 is : 47.515
loss at iteration 40 is : 42.949
loss at iteration 42 is : 38.907
loss at iteration 44 is : 35.317
loss at iteration 46 is : 32.121
loss at iteration 48 is : 29.270
loss at iteration 50 is : 26.720
loss at iteration 52 is : 24.436
loss at iteration 54 is : 22.384
loss at iteration 56 is : 20.539
loss at iteration 58 is : 18.875
lo

In [36]:
R

array([[-0.00964293, -0.01166075,  0.00155903, ...,  0.00328293,
        -0.00266761,  0.00306729],
       [ 0.0170588 , -0.0001112 ,  0.00383361, ...,  0.02372725,
        -0.00332799, -0.00051538],
       [ 0.00410123, -0.000701  ,  0.01387412, ..., -0.00425814,
        -0.00783593,  0.00342477],
       ...,
       [-0.00212901,  0.00471336, -0.01954335, ..., -0.01114275,
         0.00611266,  0.00228909],
       [ 0.00316146, -0.00672992, -0.00153853, ..., -0.00672607,
        -0.00257333, -0.00158081],
       [ 0.00675236, -0.01235511, -0.00269738, ..., -0.01258623,
         0.0092575 , -0.00419386]])

[top](#f0)

In [37]:
pred = np.dot(X,R)

In [38]:
pred

array([[-2.45457470e-02, -1.41282401e-03, -2.53929813e-02, ...,
         2.85260415e-02,  1.66965013e-02, -4.81955792e-03],
       [-1.45515094e-02, -4.40662761e-04, -5.47825657e-02, ...,
         1.73789195e-02,  4.54643645e-02,  6.28937727e-03],
       [-1.52669538e-02,  1.56740273e-02, -1.44833029e-02, ...,
         2.61544813e-02,  4.79099637e-02, -1.69485327e-02],
       ...,
       [ 2.03949800e-02, -5.34700080e-02, -3.73369759e-02, ...,
        -1.48192593e-02,  3.57140811e-02, -2.90826455e-02],
       [-1.54374222e-02, -7.08870860e-03,  3.05189906e-02, ...,
         6.81535670e-02,  9.09777020e-05, -3.05701405e-02],
       [-2.60339903e-02, -1.87289677e-02, -2.07937547e-02, ...,
         4.34216770e-02,  5.73195082e-02,  1.98407427e-02]])

In [39]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2))

In [40]:
def nearest_neighbor(v, candidates, k=1):
    return np.argsort([cosine_similarity(v,row) for row in candidates])[-k:]

In [41]:
def test_vocab(X,Y,R):
    pred = np.dot(X,R)
    return sum([nearest_neighbor(row,Y)==index for index, row in enumerate(pred)])/len(pred)

In [42]:
test_vocab(X,Y,R)

array([0.5553528])

In [62]:
english_words = ['cat', 'dog', 'france', 'germany', 'king']

In [67]:
sample_eng_wembd = []
for word in english_words:
    if word in en_emb_subset:
        sample_eng_wembd.append(en_emb_subset[word])
    else:
        print(f"{word} not in en_emb_subset")

In [68]:
sample_eng_wembd[0].shape

(300,)

In [70]:
example_eng_wembd = np.vstack(sample_eng_wembd)

In [71]:
example_eng_wembd.shape

(5, 300)

In [73]:
pred_fr = np.dot(example_eng_wembd, R)

In [74]:
pred_fr_words = []
for vec in pred_fr:
    nearest_idx = nearest_neighbor(vec, list(fr_emb_subset.values()))
    nearest_word = list(fr_emb_subset.keys())[nearest_idx[0]]
    pred_fr_words.append(nearest_word)

In [75]:
for eng, fr in zip(english_words, pred_fr_words):
    print(f"{eng} is {fr}")

cat is chat
dog is chienne
france is angleterre
germany is allemagne
king is souverain


In [88]:
def translate_word(word, en_emb_subset, fr_emb_subset, R):
    if word in en_emb_subset:
        eng_vec = en_emb_subset[word]
        pred_bec = np.dot(eng_vec, R)
        fr_words = list(fr_emb_subset.keys())
        fr_vec = np.vstack([fr_emb_subset[fr] for fr in fr_words])
        nearest_idx = nearest_neighbor(pred_bec, fr_vec)[0]
        return fr_words[nearest_idx]
    else:
        return word

In [89]:
def translate_sentence(sentence, en_emb_subset, fr_emb_subset, R):
    words = word_tokenize(sentence.lower())
    result = [translate_word(w1, en_emb_subset, fr_emb_subset, R) for w1 in words]
    return " ".join(result)

In [90]:
english_phrases = ['cat is playing with the dog', 'I would like to have some food to eat', 'I am going to the market to buy some fruits']

In [93]:
for sentence in english_phrases:
    translated_sentence = translate_sentence(sentence, en_emb_subset, fr_emb_subset, R)
    print('----------')
    print(f"{sentence}")
    print('---------->')
    print(translated_sentence)

----------
cat is playing with the dog
---------->
chat is jouant mais même chienne
----------
I would like to have some food to eat
---------->
i pourrait évidemment to mais mais aliments to manger
----------
I am going to the market to buy some fruits
---------->
i am essayer to même marché to vendre mais fruits
