<a href="https://colab.research.google.com/github/thedatadj/natural-language-processing/blob/main/machine_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Program that translates English to French.

# Embeddings
Load vector representation for english and french words.

In [28]:
# Download files
!gdown 16TjH95jhGzqA8f0zSf0uOrAYM_E0aYtT
!gdown 1NrhEHDYmgSslUwi292S0x5isy0Av6H9C

Downloading...
From: https://drive.google.com/uc?id=16TjH95jhGzqA8f0zSf0uOrAYM_E0aYtT
To: /content/en_embeddings.p
100% 8.12M/8.12M [00:00<00:00, 106MB/s]
Downloading...
From: https://drive.google.com/uc?id=1NrhEHDYmgSslUwi292S0x5isy0Av6H9C
To: /content/fr_embeddings.p
100% 7.36M/7.36M [00:00<00:00, 92.4MB/s]


In [2]:
import pickle

Load dictionary containing words and their 300-dimensional vector representation.

In [3]:
en_embeddings_subset = pickle.load(open("/content/en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("/content/fr_embeddings.p", "rb"))

First 10 components of vector representation of the word "the".

In [32]:
en_embeddings_subset['the'][:10]

array([ 0.08007812,  0.10498047,  0.04980469,  0.0534668 , -0.06738281,
       -0.12060547,  0.03515625, -0.11865234,  0.04394531,  0.03015137],
      dtype=float32)

# English to French word dictionary
Load a dictionary containing the translation of the words from english to french.

In [33]:
# Download file
!gdown 12-kA2qWDbMGO7yWZnMS6zed0E26QOJD2
!gdown 1A8F6fdFTPc9VGMa2omX_6BVxKe6IejVh

Downloading...
From: https://drive.google.com/uc?id=12-kA2qWDbMGO7yWZnMS6zed0E26QOJD2
To: /content/en-fr.test.txt
100% 50.5k/50.5k [00:00<00:00, 65.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1A8F6fdFTPc9VGMa2omX_6BVxKe6IejVh
To: /content/en-fr.train.txt
100% 179k/179k [00:00<00:00, 97.8MB/s]


In [34]:
import pandas as pd

In [35]:
file1 = pd.read_csv("/content/en-fr.train.txt", delimiter=' ')

file1.head()

Unnamed: 0,the,le
0,the,les
1,the,la
2,and,et
3,was,fut
4,was,etait


In [40]:
# Dictionary for training
enfr_train = {}
for i in range(len(file1)):
    enword = file1.loc[i][0]
    frword = file1.loc[i][1]
    enfr_train[enword] = frword

French translation of the word "the".

In [42]:
enfr_train['the']

'la'

# Training set
Prepare the training data by getting:
* A feature matrix `X_train` containing the word embeddings of english words.
* A target matrix `Y_train` containing the word embeddings of french words.

In [43]:
# English and French vocabularies
envocab = set(en_embeddings_subset.keys())
frvocab = set(fr_embeddings_subset.keys())

French vocabulary that has an english translation.

In [44]:
frvocaben = set(enfr_train.values())

In [45]:
import numpy as np

In [46]:
xlist = []
ylist = []
for enword, frword in enfr_train.items():
    if frword in frvocab and enword in envocab:
        # Get embeddings
        env = en_embeddings_subset[enword]
        frv = fr_embeddings_subset[frword]

        # Store embeddings in a list
        xlist.append(env)
        ylist.append(frv)

# Create feature and targe matrices
X_train = np.array(xlist)
Y_train = np.array(ylist)

# Training algorithm
Train a model to predict the french translation of an english word.

Loss function

$$ L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2}$$

In [47]:
def L(X, Y, R):
    m = X.shape[0]
    a = X.dot(R) - Y
    loss = np.sum(np.square(a))/m

    return loss

Gradient

$$\frac{d}{dR}𝐿(𝑋,𝑌,𝑅) = \frac{2}{m}X^{T} (X R - Y)$$

In [48]:
def derivative(X, Y, R):
    m = X.shape[0]
    a = X.dot(R) - Y
    gradient = (2/m)*(X.T).dot(a)

    return gradient

Gradient descent
$$R = R - \alpha \frac{dL}{dR}$$

In [51]:
# Hyperparameters
train_steps = 400
verbose = True
learning_rate = 0.8

In [52]:
# Initialize R
np.random.seed(129)
R = np.random.rand(X_train.shape[1], X_train.shape[1])

# Training loop
for i in range(train_steps):
    if verbose and i % 25 == 0:
        print(f"Loos at iteration no. {i} is {L(X_train, Y_train, R):.4f}")

    gradient = derivative(X_train, Y_train, R)

    R -= learning_rate*gradient

Loos at iteration no. 0 is 963.0146
Loos at iteration no. 25 is 97.8292
Loos at iteration no. 50 is 26.8329
Loos at iteration no. 75 is 9.7893
Loos at iteration no. 100 is 4.3776
Loos at iteration no. 125 is 2.3281
Loos at iteration no. 150 is 1.4480
Loos at iteration no. 175 is 1.0338
Loos at iteration no. 200 is 0.8251
Loos at iteration no. 225 is 0.7145
Loos at iteration no. 250 is 0.6534
Loos at iteration no. 275 is 0.6185
Loos at iteration no. 300 is 0.5981
Loos at iteration no. 325 is 0.5858
Loos at iteration no. 350 is 0.5782
Loos at iteration no. 375 is 0.5735


# Testing
Evaluate the accuracy of the model on an unseen dataset.

* To compare similar vectors we'll use cosine similarity.
* To find an approximation for the output we will look for the nearest neighbor.


In [54]:
def cos(v, u):
    dot = v.dot(u)
    normv = np.linalg.norm(v)
    normu = np.linalg.norm(u)
    cos = dot / (normv * normu)
    return cos

In [53]:
def nearest_neighbor(vector, candidates, k=1):
    similarityrange = []
    for domainvector in candidates:
        similarityvalue = cos(vector, domainvector)
        similarityrange.append(similarityvalue)
    # Get the indices of the top k most similar vectors
    sortedindices = np.argsort(similarityrange)
    sortedindices = np.flip(sortedindices)
    topk = sortedindices[:k]
    return topk

Determine accurary

In [55]:
def accuracy(X, Y, R):
    pred = X.dot(R)

    correct = 0
    for i in range(len(pred)):
        predindex = nearest_neighbor(pred[i], Y, k=1)

        if predindex == i:
            correct += 1

    accuracy = correct/X.shape[0]
    return accuracy

Load validation set

In [56]:
file2 = pd.read_csv("/content/en-fr.test.txt", delimiter=" ")
file2.head()

Unnamed: 0,torpedo,torpille
0,torpedo,torpilles
1,giovanni,giovanni
2,chat,discuter
3,chat,discussion
4,chat,causerie


In [57]:
# Dictionary for validation
enfr_valid = {}
for i in range(len(file2)):
    enword = file2.loc[i][0]
    frword = file2.loc[i][0]
    enfr_valid[enword] = frword

In [58]:
xlist = []
ylist = []
for enword, frword in enfr_train.items():
    if frword in frvocab and enword in envocab:
        # Get embeddings
        env = en_embeddings_subset[enword]
        frv = fr_embeddings_subset[frword]

        # Store embeddings in a list
        xlist.append(env)
        ylist.append(frv)

# Create feature and targe matrices
X_valid = np.array(xlist)
Y_valid = np.array(ylist)

In [59]:
# Validation accuracy
accuracy(X_valid, Y_valid, R)

0.5523114355231143

<table>
    <tr>
        <td>
            Based on
        </td>
        <td>
            Assignment from the Natural Language Processing Specialization in coursera.
        </td>
    </tr>
</table>