<a href="https://colab.research.google.com/github/sujitpal/nlp-deeplearning-ai-examples/blob/master/01_04_word_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#English to Spanish Word Translation

Idea is to learn the transformation (as a set of weights for a neural network) from embeddings of a word in English to the embeddings of the equivalent word in Spanish. We use the [FreeDict site](https://freedict.org/downloads/#dictionary-downloads) to get the set of parallel words.

(Note: the downloaded file eng-spa.dict.dz file is in the form of plain text and is not convenient for machine reading, better to download [the TEI (XML) file from github](https://github.com/freedict/fd-dictionaries/blob/master/eng-spa/eng-spa.tei) and parse as shown here.)

To generate the embeddings, we use Fasttext which provides English and Spanish embeddings.

We then train a neural network to take as input the english embedding for a word and output the corresponding Spanish embedding.

We also train a KD-Tree on the Spanish embeddings of the test set, so we can find nearest neighbors of the predictions.

During prediction, we send in the embeddings for the English word, and check the top 3 nearest neighbors of the predicted embedding for the Spanish word.

In [2]:
%pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████▊                           | 10kB 17.2MB/s eta 0:00:01[K     |█████████▌                      | 20kB 20.2MB/s eta 0:00:01[K     |██████████████▎                 | 30kB 23.5MB/s eta 0:00:01[K     |███████████████████             | 40kB 24.9MB/s eta 0:00:01[K     |███████████████████████▉        | 51kB 14.5MB/s eta 0:00:01[K     |████████████████████████████▋   | 61kB 14.5MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 6.5MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3015765 sha256=16dcc59e94c4ba1b839962e9f391aea61ee48fd28466bcfb28db67e1278121d8
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee9

In [3]:
import fasttext
import fasttext.util
import numpy as np
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KDTree
from torch.utils.data import TensorDataset, DataLoader

In [4]:
# Mount Google Drive
from google.colab import drive # import drive from google colab

ROOT = "/content/drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)

drive.mount(ROOT)           # we mount the google drive at /content/drive

/content/drive
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
%ls "drive/My Drive/nlp-deeplearning-ai-data"

cc.en.300.bin   eng-spa.tei     testdata.manual.2009.06.14.csv
cc.es.300.bin   eng-spa.tsv     training.1600000.processed.noemoticon.csv
eng-embeds.txt  spa_embeds.txt


In [6]:
DATA_DIR = "drive/My Drive/nlp-deeplearning-ai-data"

## Dataset Creation

Get data from FreeDict for parallel english-spanish word set, then generate embeddings using the Fasttext English and Spanish models respectively.

In [7]:
def maybe_parse_xmldict(tei_file, tsv_file):
  if not os.path.exists(tsv_file):
    ftsv = open(tsv_file, "w")
    with open(tei_file) as f:
      contents = f.read()
    num_pairs = 0
    soup = BeautifulSoup(contents, "xml")
    for entry in soup.find_all("entry"):
      if num_pairs % 1000 == 0:
        print("{:d} pairs extracted".format(num_pairs))
      eng = entry.find_all("orth")[0].get_text()
      spa = entry.find_all("quote")[0].get_text()
      ftsv.write("{:s}\t{:s}\n".format(eng, spa))
      num_pairs += 1

    print("{:d} pairs extracted, COMPLETE".format(num_pairs))
    ftsv.close()


maybe_parse_xmldict(os.path.join(DATA_DIR, "eng-spa.tei"),
                    os.path.join(DATA_DIR, "eng-spa.tsv"))

In [8]:
words_df = pd.read_csv(os.path.join(DATA_DIR, "eng-spa.tsv"), sep='\t', names=["eng", "spa"])
words_df.head()

Unnamed: 0,eng,spa
0,zucchini,calabacín
1,ABC,abecé
2,Adam,Adán
3,Africa,África
4,African,africano


In [9]:
def maybe_generate_embeddings(words, lang, model_name, embed_file):
  if not os.path.exists(embed_file):
    fasttext.util.download_model(lang, if_exists="ignore")
    embeddings = fasttext.load_model(model_name)
    femb = open(embed_file, "w")
    for word in words:
      vec = embeddings[word]
      vec_str = " ".join(["{:.5f}".format(v) for v in vec])
      femb.write("{:s}\t{:s}\n".format(word, vec_str))
    femb.close()

maybe_generate_embeddings(words_df["eng"].values, "en", "cc.en.300.bin",
                          os.path.join(DATA_DIR, "eng-embeds.txt"))
maybe_generate_embeddings(words_df["spa"].values, "es", "cc.es.300.bin",
                          os.path.join(DATA_DIR, "spa_embeds.txt"))

## Split into Train, Validation, and Test

In [11]:
def extract_vectors(eng_file, spa_file):
    eng_words, eng_vecs, spa_words, spa_vecs = [], [], [], []
    with open(eng_file, "r") as feng:
      for line in feng:
        word, vec = line.strip().split('\t')
        eng_words.append(word)
        eng_vecs.append([float(v) for v in vec.split()])
    with open(spa_file, "r") as fspa:
      for line in fspa:
        word, vec = line.strip().split('\t')
        spa_words.append(word)
        spa_vecs.append([float(v) for v in vec.split()])
    return eng_words, np.array(eng_vecs), spa_words, np.array(spa_vecs)  

eng_words, X, spa_words, Y = extract_vectors(
    os.path.join(DATA_DIR, "eng-embeds.txt"),
    os.path.join(DATA_DIR, "spa_embeds.txt"))

Xtv, Xtest, Ytv, Ytest, _, etest, _, stest = train_test_split(X, Y, eng_words, spa_words, test_size=0.3)
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtv, Ytv, test_size=0.1)
Xtrain.shape, Ytrain.shape, Xval.shape, Yval.shape, Xtest.shape, Ytest.shape

((3720, 300), (3720, 300), (414, 300), (414, 300), (1773, 300), (1773, 300))

## Torch Stuff -- Dataset, DataLoader, and Network

In [12]:
def make_dataset(X, Y):
  xt = torch.tensor(X, dtype=torch.float32)
  yt = torch.tensor(Y, dtype=torch.float32)
  return TensorDataset(xt, yt)

train_ds = make_dataset(Xtrain, Ytrain)
val_ds = make_dataset(Xval, Yval)
test_ds = make_dataset(Xtest, Ytest)

In [13]:
train_dl = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4)
val_dl = DataLoader(val_ds, batch_size=32, shuffle=False, num_workers=4)
test_dl = DataLoader(test_ds, batch_size=32, shuffle=False, num_workers=4)

In [14]:
class WordTranslationNet(torch.nn.Module):
  def __init__(self, input_size, hidden_size, output_size):
    super().__init__()
    self.encoder = nn.Linear(input_size, hidden_size)
    self.decoder = nn.Linear(hidden_size, output_size)

  def forward(self, x):
    x = self.encoder(x)
    x = F.relu(x)
    x = self.decoder(x)
    return x

net = WordTranslationNet(300, 50, 300)
net

WordTranslationNet(
  (encoder): Linear(in_features=300, out_features=50, bias=True)
  (decoder): Linear(in_features=50, out_features=300, bias=True)
)

## Train and Eval loops

In [15]:
def train(net, dev, train_dl, val_dl, num_epochs=20, lr=1e-4):
  params = filter(lambda p: p.requires_grad, net.parameters())
  optimizer = torch.optim.Adam(params, lr=lr)
  for i in range(num_epochs):
    net.train()
    sum_loss, total = 0, 0
    for x, y in train_dl:
      x, y = x.to(dev), y.to(dev)
      y_ = net(x)
      optimizer.zero_grad()
      loss = F.mse_loss(y_, y)
      loss.backward()
      optimizer.step()
      sum_loss += loss.item() * y.shape[0]
      total += y.shape[0]
    val_loss = evaluate(net, dev, val_dl)
    print("EPOCH {:d}: train loss: {:.3f}, val loss: {:.3f}"
      .format(i, sum_loss / total, val_loss))


def evaluate(net, dev, val_dl):
  net.eval()
  correct, total, sum_loss = 0, 0, 0
  for x, y in val_dl:
    x, y = x.to(dev), y.to(dev)
    y_ = net(x)
    loss = F.mse_loss(y_, y)
    _, pred = torch.max(y_, 1)
    total += y.shape[0]
    sum_loss += loss.item() * y.shape[0]
  return sum_loss / total

## Train

In [16]:
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
net.to(dev)

WordTranslationNet(
  (encoder): Linear(in_features=300, out_features=50, bias=True)
  (decoder): Linear(in_features=50, out_features=300, bias=True)
)

In [17]:
train(net, dev, train_dl, val_dl, num_epochs=100)

EPOCH 0: train loss: 0.009, val loss: 0.007
EPOCH 1: train loss: 0.005, val loss: 0.004
EPOCH 2: train loss: 0.003, val loss: 0.003
EPOCH 3: train loss: 0.003, val loss: 0.003
EPOCH 4: train loss: 0.003, val loss: 0.003
EPOCH 5: train loss: 0.003, val loss: 0.003
EPOCH 6: train loss: 0.003, val loss: 0.002
EPOCH 7: train loss: 0.003, val loss: 0.002
EPOCH 8: train loss: 0.003, val loss: 0.002
EPOCH 9: train loss: 0.003, val loss: 0.002
EPOCH 10: train loss: 0.002, val loss: 0.002
EPOCH 11: train loss: 0.002, val loss: 0.002
EPOCH 12: train loss: 0.002, val loss: 0.002
EPOCH 13: train loss: 0.002, val loss: 0.002
EPOCH 14: train loss: 0.002, val loss: 0.002
EPOCH 15: train loss: 0.002, val loss: 0.002
EPOCH 16: train loss: 0.002, val loss: 0.002
EPOCH 17: train loss: 0.002, val loss: 0.002
EPOCH 18: train loss: 0.002, val loss: 0.002
EPOCH 19: train loss: 0.002, val loss: 0.002
EPOCH 20: train loss: 0.002, val loss: 0.002
EPOCH 21: train loss: 0.002, val loss: 0.002
EPOCH 22: train loss

## Generate Predictions and Evaluate

In [18]:
labels, predictions = [], []
net.eval()
for xtest, ytest in test_dl:
  xtest = xtest.to(dev)
  ytest_ = net(xtest)
  # labels.extend(ytest.numpy())
  predictions.extend(ytest_.cpu().detach().numpy())

len(labels), len(predictions)

(0, 1773)

In [19]:
kdt = KDTree(Ytest, leaf_size=30, metric='euclidean')
kdt.query(Ytest[0,:].reshape(1, -1), k=3, return_distance=False)

array([[   0, 1172, 1255]])

In [20]:
correct, incorrect = 0, 0
for i, pred in enumerate(predictions):
  neighbors = kdt.query(pred.reshape(1, -1), k=3, return_distance=False)[0]
  if i in neighbors:
    print(i, etest[i], stest[neighbors[0]], stest[neighbors[1]], stest[neighbors[2]])
    correct += 1
  else:
    incorrect += 1
print(correct, incorrect)

198 wife aristócrata esposa periodista
202 milk leche mermelada chocolate
211 fuel tank depósito de gasolina depósito de gasolina de poca profundidad
235 Greenland Escandinavia Groenlandia Mediterráneo
255 knife cuchillo camisadedeporte ensangrentado
260 destroy controlar considerar destruir
298 veal mantequilla berenjena carnedeternera
304 warm caliente caliente agradable
334 also de poca profundidad también también
335 conquer abandonar aprovechar conquistar
405 beet remolacha mantequilla chocolate
412 zoo civilización jardínzoológico billetedeidayvuelta
533 consider desentenderse de considerar desprendimiento de tierras
535 petrol tank depósito de gasolina depósito de gasolina piernas torcidas hacia afuera
626 baker panadero aristócrata periodista
647 suitable conveniente conveniente desentenderse de
680 Asian americano asiático asiático
684 sustain considerar aprovechar sostener
689 too demasiado sobretodo bastante
751 apricot albaricoquero depósito de gasolina depósito de gasolina