In [None]:
%matplotlib inline

Source: [https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [None]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f74527d8190>

In [None]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
print(lookup_tensor)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

embeds(torch.tensor([[0,1], [1,0]], dtype=torch.long))

tensor([0])
tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


tensor([[[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519],
         [-0.1661, -1.5228,  0.3817, -1.0276, -0.5631]],

        [[-0.1661, -1.5228,  0.3817, -1.0276, -0.5631],
         [ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]]],
       grad_fn=<EmbeddingBackward0>)

# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.




In [None]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)

trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]

# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
[520.363648891449, 517.6042952537537, 514.8673779964447, 512.1517074108124, 509.45520973205566, 506.77791261672974, 504.11913299560547, 501.4802691936493, 498.85717129707336, 496.24901127815247]


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.




In [None]:
import numpy as np
from tqdm import tqdm

In [None]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


def generate_batches(data, batch_size=8):
    X = [data[i][0] for i in range(len(data))]
    y = [data[i][1] for i in range(len(data))]
    num_batches = len(data)//batch_size
    max_idx = batch_size * num_batches
    X = X[:max_idx]
    y = y[:max_idx]
    X_batched = []
    y_batched = []
    for i in range(num_batches):
      X_batched.append(X[i*batch_size:(i+1)*batch_size])
      y_batched.append(y[i*batch_size:(i+1)*batch_size])
    return X_batched, y_batched


class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim=256):
      super(CBOW, self).__init__()
      # Embedding layer - Lookup table
      self.embeddings = nn.Embedding(vocab_size, embedding_dim)
      # Layer 1 - Since we'll be summing up the context vectors, the input to this layer will be embedding_dim. Output is a 256 dim. vector
      self.linear1 = nn.Linear(embedding_dim, hidden_dim)
      # Adding non-linearity through ReLU
      self.relu = nn.ReLU()
      # Final layer to get to vocab size dim
      self.linear2 = nn.Linear(hidden_dim, vocab_size)
      # Log softmax to get probabilities
      self.log_softmax = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):
      # inputs are context vectors. Get embeddings for them
      x = self.embeddings(inputs)
      # sum all context vectors
      x = torch.sum(x,axis=1).view(inputs.shape[0],-1)
      #x = sum(x).view(1,-1)
      # Add first layer
      x = self.linear1(x)
      # Add relu
      x = self.relu(x)
      # Add final layer
      x = self.linear2(x)
      # Get log softmax
      x = self.log_softmax(x)
      return x


# create your model and train.  here are some functions to help you make
# the data ready for use by your module

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

# Function to get context vectors with batched data
def make_context_vector2(context_list, word_to_ix):
    idxs = []
    for context in context_list:
      idxs.append([word_to_ix[w] for w in context])
    return torch.tensor(idxs, dtype=torch.long)

# Function to get context vectors with batched data
def make_labels_idx(labels, word_to_ix):
    idxs = []
    for label in labels:
      idxs.append(word_to_ix[label])
    return torch.tensor(idxs, dtype=torch.long)

# Function to train the model
def train(model, data, vocab, device, word_to_ix = word_to_ix, NUM_EPOCHS=15):
  losses = []
  loss_function = nn.NLLLoss()
  model = model.to(device)
  optimizer = optim.SGD(model.parameters(), lr=0.001)
  X_batched, y_batched = generate_batches(data, batch_size=8)

  for epoch in range(NUM_EPOCHS):
      total_loss = 0
      for context, target in tqdm(zip(X_batched, y_batched), total=len(X_batched)):

          # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
          # into integer indices and wrap them in tensors)
          context_idxs = make_context_vector2(context, word_to_ix).to(device)
          target_idxs = make_labels_idx(target, word_to_ix).to(device)

          # Step 2. Recall that torch *accumulates* gradients. Before passing in a
          # new instance, you need to zero out the gradients from the old
          # instance
          model.zero_grad()

          # Step 3. Run the forward pass, getting log probabilities over next
          # words
          log_probs = model(context_idxs)

          # Step 4. Compute your loss function. (Again, Torch wants the target
          # word wrapped in a tensor)
          loss = loss_function(log_probs, target_idxs)

          # Step 5. Do the backward pass and update the gradient
          loss.backward()
          optimizer.step()

          # Get the Python number from a 1-element Tensor by calling tensor.item()
          total_loss += loss.item()

      print("Loss Epoch {ep} = {ls}".format(ep = epoch, ls = total_loss))
      losses.append(total_loss)
  #print(losses)  # The loss decreased every iteration over the training data!

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


In [None]:
# Define the model
model = CBOW(vocab_size = len(vocab), embedding_dim=50)

# Device for GPU Training
device = torch.device("cpu")
if torch.cuda.is_available():
   print("Training on GPU")
   device = torch.device("cuda:0")

# Train
train(model, data, vocab, device, NUM_EPOCHS=50)

Training on GPU


100%|██████████| 7/7 [00:00<00:00,  8.73it/s]


Loss Epoch 0 = 28.088960647583008


100%|██████████| 7/7 [00:00<00:00, 688.53it/s]


Loss Epoch 1 = 27.926408529281616


100%|██████████| 7/7 [00:00<00:00, 627.59it/s]


Loss Epoch 2 = 27.76490068435669


100%|██████████| 7/7 [00:00<00:00, 693.88it/s]


Loss Epoch 3 = 27.604418992996216


100%|██████████| 7/7 [00:00<00:00, 653.09it/s]


Loss Epoch 4 = 27.44498109817505


100%|██████████| 7/7 [00:00<00:00, 673.54it/s]


Loss Epoch 5 = 27.286572217941284


100%|██████████| 7/7 [00:00<00:00, 716.10it/s]


Loss Epoch 6 = 27.12918996810913


100%|██████████| 7/7 [00:00<00:00, 647.43it/s]


Loss Epoch 7 = 26.972814083099365


100%|██████████| 7/7 [00:00<00:00, 675.35it/s]


Loss Epoch 8 = 26.81743288040161


100%|██████████| 7/7 [00:00<00:00, 667.25it/s]


Loss Epoch 9 = 26.66304850578308


100%|██████████| 7/7 [00:00<00:00, 643.34it/s]


Loss Epoch 10 = 26.509656190872192


100%|██████████| 7/7 [00:00<00:00, 677.42it/s]


Loss Epoch 11 = 26.35727095603943


100%|██████████| 7/7 [00:00<00:00, 503.51it/s]


Loss Epoch 12 = 26.20590591430664


100%|██████████| 7/7 [00:00<00:00, 543.69it/s]


Loss Epoch 13 = 26.055581092834473


100%|██████████| 7/7 [00:00<00:00, 625.75it/s]


Loss Epoch 14 = 25.90627360343933


100%|██████████| 7/7 [00:00<00:00, 508.69it/s]


Loss Epoch 15 = 25.757943391799927


100%|██████████| 7/7 [00:00<00:00, 614.14it/s]


Loss Epoch 16 = 25.610628366470337


100%|██████████| 7/7 [00:00<00:00, 508.96it/s]


Loss Epoch 17 = 25.46427321434021


100%|██████████| 7/7 [00:00<00:00, 368.24it/s]


Loss Epoch 18 = 25.318897485733032


100%|██████████| 7/7 [00:00<00:00, 447.47it/s]


Loss Epoch 19 = 25.174449682235718


100%|██████████| 7/7 [00:00<00:00, 704.81it/s]


Loss Epoch 20 = 25.030962705612183


100%|██████████| 7/7 [00:00<00:00, 667.32it/s]


Loss Epoch 21 = 24.888399600982666


100%|██████████| 7/7 [00:00<00:00, 646.36it/s]


Loss Epoch 22 = 24.746774673461914


100%|██████████| 7/7 [00:00<00:00, 653.20it/s]


Loss Epoch 23 = 24.606098175048828


100%|██████████| 7/7 [00:00<00:00, 673.37it/s]


Loss Epoch 24 = 24.466334342956543


100%|██████████| 7/7 [00:00<00:00, 678.82it/s]


Loss Epoch 25 = 24.327523231506348


100%|██████████| 7/7 [00:00<00:00, 657.33it/s]


Loss Epoch 26 = 24.189621925354004


100%|██████████| 7/7 [00:00<00:00, 656.50it/s]


Loss Epoch 27 = 24.05262804031372


100%|██████████| 7/7 [00:00<00:00, 496.78it/s]


Loss Epoch 28 = 23.916526079177856


100%|██████████| 7/7 [00:00<00:00, 486.83it/s]


Loss Epoch 29 = 23.781309366226196


100%|██████████| 7/7 [00:00<00:00, 500.27it/s]


Loss Epoch 30 = 23.646979093551636


100%|██████████| 7/7 [00:00<00:00, 504.78it/s]


Loss Epoch 31 = 23.513529539108276


100%|██████████| 7/7 [00:00<00:00, 425.41it/s]


Loss Epoch 32 = 23.380939722061157


100%|██████████| 7/7 [00:00<00:00, 456.82it/s]


Loss Epoch 33 = 23.249220371246338


100%|██████████| 7/7 [00:00<00:00, 641.79it/s]


Loss Epoch 34 = 23.118361949920654


100%|██████████| 7/7 [00:00<00:00, 565.74it/s]


Loss Epoch 35 = 22.988409757614136


100%|██████████| 7/7 [00:00<00:00, 667.05it/s]


Loss Epoch 36 = 22.859317541122437


100%|██████████| 7/7 [00:00<00:00, 659.20it/s]


Loss Epoch 37 = 22.731013536453247


100%|██████████| 7/7 [00:00<00:00, 604.18it/s]


Loss Epoch 38 = 22.603514432907104


100%|██████████| 7/7 [00:00<00:00, 557.58it/s]


Loss Epoch 39 = 22.47684383392334


100%|██████████| 7/7 [00:00<00:00, 571.43it/s]


Loss Epoch 40 = 22.35092568397522


100%|██████████| 7/7 [00:00<00:00, 475.78it/s]


Loss Epoch 41 = 22.225877285003662


100%|██████████| 7/7 [00:00<00:00, 441.47it/s]


Loss Epoch 42 = 22.10160994529724


100%|██████████| 7/7 [00:00<00:00, 430.11it/s]


Loss Epoch 43 = 21.978124141693115


100%|██████████| 7/7 [00:00<00:00, 484.61it/s]


Loss Epoch 44 = 21.855432748794556


100%|██████████| 7/7 [00:00<00:00, 468.06it/s]


Loss Epoch 45 = 21.73341178894043


100%|██████████| 7/7 [00:00<00:00, 480.36it/s]


Loss Epoch 46 = 21.612124919891357


100%|██████████| 7/7 [00:00<00:00, 447.15it/s]


Loss Epoch 47 = 21.491501092910767


100%|██████████| 7/7 [00:00<00:00, 388.25it/s]


Loss Epoch 48 = 21.371620893478394


100%|██████████| 7/7 [00:00<00:00, 451.61it/s]

Loss Epoch 49 = 21.25249695777893





## Part 1 - Training CBOW for Trip Advisor and Sci-Fi Datasets

In [None]:
import pandas as pd
import nltk
nltk.download("punkt")
from nltk.tokenize import wordpunct_tokenize, sent_tokenize
import string
from tqdm import tqdm
from time import time
import numpy as np
import pickle

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
## Load data

"""
trip_advisor_url = https://drive.google.com/file/d/1foE1JuZJeu5E_4qVge9kExzhvF32teuF/view
scifi_url = https://drive.google.com/file/d/13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75/view
"""

from google.colab import drive
drive.mount('/content/drive')

# Change the paths accordingly
# trip_advisor_path = '/content/drive/MyDrive/mlnlp1/exercise-2/data/tripadvisor_hotel_reviews_reduced.csv'
# scifi_path = '/content/drive/MyDrive/mlnlp1/exercise-2/data/scifi_reduced.txt'
trip_advisor_path = 'tripadvisor_hotel_reviews_reduced.csv'
scifi_path = 'scifi_reduced.txt'

# DF trip advisor
df_trip = pd.read_csv(trip_advisor_path)
print(df_trip.head())

# Scifi text
text_file = open(scifi_path, "r")
scifi_text = text_file.read()
text_file.close()

print("Scifi Text")
print(scifi_text[:1000])

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
                                              Review  Rating
0  fantastic service large hotel caters business ...       5
1  great hotel modern hotel good location, locate...       4
2  3 star plus glasgowjust got 30th november 4 da...       4
3  nice stayed hotel nov 19-23. great little bout...       4
4  great place wonderful hotel ideally located me...       5
Scifi Text
 A chat with the editor  i #  science fiction magazine called IF. The title was selected after much thought because of its brevity and on the theory it is indicative of the field and will be easy to remember. The tentative title that just morning and couldn't remember it until we'd had a cup of coffee, it was summarily discarded. A great deal of thought and effort lias gone into the formation of this magazine. We have had the aid of several very talented and generous people, for which we a

In [None]:
# Function to pre-process
# 1. Lowercase text
# 2. Tokenize based on sentences - split on "." - we'll get a list of sentences
# 3. For each sentence from 2, tokenize based on punctuations - nltk wordpunct_tokenize - we'll get a list of list of words in a sentence
# 4. Remove punctuations
def preprocess(text):
  text = text.lower()
  text_sent_token = text.split(".")
  text_punct_token = [ wordpunct_tokenize(sent) for sent in text_sent_token]
  text_punct_token_cleaned = []
  # Remove punctuations from tokenized list of lists.
  for txt_list in text_punct_token:
    clean_txt = []
    for txt in txt_list:
      if txt not in string.punctuation:
        clean_txt.append(txt)
    # Take only texts having more than 1 element
    if len(clean_txt)>1:
       text_punct_token_cleaned.append(clean_txt)
  return text_punct_token_cleaned


In [None]:
# Function to get vocab
# Input will be a list of lists
# Output - vocab set
def get_vocab(raw_llist):
  vocab = set()
  for lst in raw_llist:
    for el in lst:
      vocab.add(el)
  return vocab

# Function to get word-to-ix dictionary
def get_word2ix(vocab, save_loc):
  word_to_ix = {word: i for i, word in enumerate(vocab)}
  with open(save_loc, 'wb') as handle:
    pickle.dump(word_to_ix, handle, protocol=pickle.HIGHEST_PROTOCOL)
  return word_to_ix

In [None]:
# Function to generate tuples of context-target
# Input = list of lists
# Output = list of tuples (list of context_words, target)
def get_data(raw_llist, context_window=5):
  data = []
  for raw_list in raw_llist:
    for i in range(context_window, len(raw_list) - context_window):
      context = []
      if i-context_window>=0 and i+context_window<len(raw_list):
        for j in range(i-context_window, i+context_window+1):
          if j!=i:
            context.append(raw_list[j])
        target = raw_list[i]
      data.append((context, target))
  return data

In [None]:
# Copy - Paste from previous code (for simplicity)

class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim=256):
      super(CBOW, self).__init__()
      # Embedding layer - Lookup table
      self.embeddings = nn.Embedding(vocab_size, embedding_dim)
      # Layer 1 - Since we'll be summing up the context vectors,
      # the input to this layer will be embedding_dim. Output is a 256 dim. vector
      self.linear1 = nn.Linear(embedding_dim, hidden_dim)
      # Adding non-linearity through ReLU
      self.relu = nn.ReLU()
      # Final layer to get to vocab size dim
      self.linear2 = nn.Linear(hidden_dim, vocab_size)
      # Log softmax to get probabilities
      self.log_softmax = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):
      # inputs are context vectors. Get embeddings for them
      x = self.embeddings(inputs)
      # sum all context vectors
      x = torch.sum(x,axis=1).view(inputs.shape[0],-1)
      #x = sum(x).view(1,-1)
      # Add first layer
      x = self.linear1(x)
      # Add relu
      x = self.relu(x)
      # Add final layer
      x = self.linear2(x)
      # Get log softmax
      x = self.log_softmax(x)
      return x


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


# Function to get context vectors with batched data
def make_context_vector2(context_list, word_to_ix):
    idxs = []
    for context in context_list:
      idxs.append([word_to_ix[w] for w in context])
    return torch.tensor(idxs, dtype=torch.long)

# Function to get context vectors with batched data
def make_labels_idx(labels, word_to_ix):
    idxs = []
    for label in labels:
      idxs.append(word_to_ix[label])
    return torch.tensor(idxs, dtype=torch.long)


# Function to train the model - minor changes in arguments
def train(model, data, vocab, device, word_to_ix, NUM_EPOCHS=15):
  losses = []
  loss_function = nn.NLLLoss()
  model = model.to(device)
  optimizer = optim.SGD(model.parameters(), lr=0.001)
  X_batched, y_batched = generate_batches(data, batch_size=16)

  for epoch in range(NUM_EPOCHS):
      total_loss = 0
      for context, target in tqdm(zip(X_batched, y_batched), total=len(X_batched)):

          # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
          # into integer indices and wrap them in tensors)
          context_idxs = make_context_vector2(context, word_to_ix).to(device)
          target_idxs = make_labels_idx(target, word_to_ix).to(device)

          # Step 2. Recall that torch *accumulates* gradients. Before passing in a
          # new instance, you need to zero out the gradients from the old
          # instance
          model.zero_grad()

          # Step 3. Run the forward pass, getting log probabilities over next
          # words
          log_probs = model(context_idxs)

          # Step 4. Compute your loss function. (Again, Torch wants the target
          # word wrapped in a tensor)
          loss = loss_function(log_probs, target_idxs)

          # Step 5. Do the backward pass and update the gradient
          loss.backward()
          optimizer.step()

          # Get the Python number from a 1-element Tensor by calling tensor.item()
          total_loss += loss.item()

      print("Loss Epoch {ep} = {ls}".format(ep = epoch, ls = total_loss))
      losses.append(total_loss)
  print(losses)


### Trip-Advisor Training




In [None]:
# Preprocess
trip_pp = df_trip["Review"].apply(lambda x: preprocess(x))

# Convert to list of lists
trip_pp_llist = []
trip_pp_rows = trip_pp.tolist()
for trip_pp_row in trip_pp_rows:
  for trip_pp_row_list in trip_pp_row:
    trip_pp_llist.append(trip_pp_row_list)

# Vocab
trip_vocab = get_vocab(trip_pp_llist)

# Word2Ix
trip_word2ix = get_word2ix(trip_vocab, '/content/drive/MyDrive/mlnlp1/exercise-2/trip_word2ix.pickle')

#### Context-window = 2

In [None]:
CONTEXT_WINDOW = 2
trip_data = get_data(trip_pp_llist, CONTEXT_WINDOW)

# Check data has correct shape
for data_sample in trip_data:
  assert(len(data_sample[0])==2*CONTEXT_WINDOW)

print("Number of samples: ", len(trip_data))


Number of samples:  973973


In [None]:
# Define the model
trip_model_window2 = CBOW(vocab_size = len(trip_vocab), embedding_dim=50)

# Device for GPU Training
device = torch.device("cpu")
if torch.cuda.is_available():
   print("Training on GPU")
   device = torch.device("cuda:0")

# Train
train(model = trip_model_window2, data=trip_data, vocab=trip_vocab, device=device, word_to_ix=trip_word2ix, NUM_EPOCHS=15)
torch.save(trip_model_window2, '/content/drive/MyDrive/mlnlp1/exercise-2/trip_model_window2.pth')


Training on GPU


100%|██████████| 60873/60873 [02:07<00:00, 478.62it/s]


Loss Epoch 0 = 537733.371711731


100%|██████████| 60873/60873 [02:01<00:00, 502.91it/s]


Loss Epoch 1 = 483126.46018075943


100%|██████████| 60873/60873 [02:05<00:00, 484.49it/s]


Loss Epoch 2 = 468764.9251241684


100%|██████████| 60873/60873 [02:05<00:00, 484.09it/s]


Loss Epoch 3 = 461329.4310104847


100%|██████████| 60873/60873 [01:59<00:00, 507.45it/s]


Loss Epoch 4 = 456403.26427435875


100%|██████████| 60873/60873 [01:59<00:00, 508.18it/s]


Loss Epoch 5 = 452727.00867414474


100%|██████████| 60873/60873 [01:59<00:00, 509.92it/s]


Loss Epoch 6 = 449790.8506155014


100%|██████████| 60873/60873 [01:59<00:00, 509.32it/s]


Loss Epoch 7 = 447334.53298544884


100%|██████████| 60873/60873 [01:59<00:00, 509.32it/s]


Loss Epoch 8 = 445208.526365757


100%|██████████| 60873/60873 [02:00<00:00, 507.05it/s]


Loss Epoch 9 = 443322.48322701454


100%|██████████| 60873/60873 [01:59<00:00, 509.77it/s]


Loss Epoch 10 = 441617.5093655586


100%|██████████| 60873/60873 [02:00<00:00, 506.63it/s]


Loss Epoch 11 = 440053.86068558693


100%|██████████| 60873/60873 [02:00<00:00, 503.56it/s]


Loss Epoch 12 = 438603.6401154995


100%|██████████| 60873/60873 [02:00<00:00, 506.36it/s]


Loss Epoch 13 = 437245.2623603344


100%|██████████| 60873/60873 [01:59<00:00, 508.31it/s]


Loss Epoch 14 = 435962.96954774857
[537733.371711731, 483126.46018075943, 468764.9251241684, 461329.4310104847, 456403.26427435875, 452727.00867414474, 449790.8506155014, 447334.53298544884, 445208.526365757, 443322.48322701454, 441617.5093655586, 440053.86068558693, 438603.6401154995, 437245.2623603344, 435962.96954774857]


In [None]:
trip_model_window2_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/trip_model_window2.pth')

#### Context-window = 5

In [None]:
CONTEXT_WINDOW = 5
trip_data_window5 = get_data(trip_pp_llist, CONTEXT_WINDOW)

# Check data has correct shape
for data_sample in trip_data_window5:
  assert(len(data_sample[0])==2*CONTEXT_WINDOW)

print("Number of samples: ", len(trip_data_window5))


Number of samples:  816667


In [None]:
# Define the model
trip_model_window5 = CBOW(vocab_size = len(trip_vocab), embedding_dim=50)

# Device for GPU Training
device = torch.device("cpu")
if torch.cuda.is_available():
   print("Training on GPU")
   device = torch.device("cuda:0")

# Train
train(model = trip_model_window5, data=trip_data_window5, vocab=trip_vocab, device=device, word_to_ix=trip_word2ix,
      NUM_EPOCHS=15)
# Save model
torch.save(trip_model_window5, '/content/drive/MyDrive/mlnlp1/exercise-2/trip_model_window5.pth')



Training on GPU


100%|██████████| 51041/51041 [01:41<00:00, 504.24it/s]


Loss Epoch 0 = 447444.58891153336


100%|██████████| 51041/51041 [01:43<00:00, 494.77it/s]


Loss Epoch 1 = 408827.3837146759


100%|██████████| 51041/51041 [01:41<00:00, 500.74it/s]


Loss Epoch 2 = 399027.7963979244


100%|██████████| 51041/51041 [01:41<00:00, 501.69it/s]


Loss Epoch 3 = 393605.24425554276


100%|██████████| 51041/51041 [01:41<00:00, 502.24it/s]


Loss Epoch 4 = 389891.3892633915


100%|██████████| 51041/51041 [01:41<00:00, 503.22it/s]


Loss Epoch 5 = 387067.2838578224


100%|██████████| 51041/51041 [01:42<00:00, 500.31it/s]


Loss Epoch 6 = 384777.1399919987


100%|██████████| 51041/51041 [01:41<00:00, 502.90it/s]


Loss Epoch 7 = 382838.22569322586


100%|██████████| 51041/51041 [01:41<00:00, 500.59it/s]


Loss Epoch 8 = 381145.7957429886


100%|██████████| 51041/51041 [01:42<00:00, 499.98it/s]


Loss Epoch 9 = 379634.40681552887


100%|██████████| 51041/51041 [01:41<00:00, 502.39it/s]


Loss Epoch 10 = 378260.5902414322


100%|██████████| 51041/51041 [01:42<00:00, 497.04it/s]


Loss Epoch 11 = 376993.6868059635


100%|██████████| 51041/51041 [01:40<00:00, 505.71it/s]


Loss Epoch 12 = 375812.3440673351


100%|██████████| 51041/51041 [01:42<00:00, 498.37it/s]


Loss Epoch 13 = 374700.0394477844


100%|██████████| 51041/51041 [01:40<00:00, 506.40it/s]


Loss Epoch 14 = 373644.72666954994
[447444.58891153336, 408827.3837146759, 399027.7963979244, 393605.24425554276, 389891.3892633915, 387067.2838578224, 384777.1399919987, 382838.22569322586, 381145.7957429886, 379634.40681552887, 378260.5902414322, 376993.6868059635, 375812.3440673351, 374700.0394477844, 373644.72666954994]


In [None]:
trip_model_window5_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/trip_model_window5.pth')

- As we can see different overall losses over the vocabulary when using window size 2 and 5, the predictions made by the model are context sensitive

### Scifi Training

#### Context-window = 2

In [None]:
CONTEXT_WINDOW = 2

scifi_preprocessed = preprocess(scifi_text)
scifi_vocab = get_vocab(scifi_preprocessed)
scifi_word2ix = get_word2ix(scifi_vocab, '/content/drive/MyDrive/mlnlp1/exercise-2/scifi_word2ix.pickle')
scifi_data = get_data(scifi_preprocessed, CONTEXT_WINDOW)
print("Number of samples: ", len(scifi_data))

# Check data has correct shape
for data_sample in scifi_data:
  assert(len(data_sample[0])==2*CONTEXT_WINDOW)

Number of samples:  5749466


In [None]:
# Define the model
scifi_model_window2 = CBOW(vocab_size = len(scifi_vocab), embedding_dim=50)

# Device for GPU Training
device = torch.device("cpu")
if torch.cuda.is_available():
   print("Training on GPU")
   device = torch.device("cuda:0")

# Train
train(model = scifi_model_window2, data=scifi_data, vocab=scifi_vocab, device=device, word_to_ix=scifi_word2ix, NUM_EPOCHS=3)
torch.save(scifi_model_window2, '/content/drive/MyDrive/mlnlp1/exercise-2/scifi_model_window2.pth')


Training on GPU


100%|██████████| 359341/359341 [28:21<00:00, 211.18it/s]


Loss Epoch 0 = 2635441.4196851254


100%|██████████| 359341/359341 [28:20<00:00, 211.33it/s]


Loss Epoch 1 = 2420369.2964789867


100%|██████████| 359341/359341 [28:21<00:00, 211.25it/s]


Loss Epoch 2 = 2369389.719114661
[2635441.4196851254, 2420369.2964789867, 2369389.719114661]


In [None]:
scifi_model_window2_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/scifi_model_window2.pth')

#### Context-window = 5

In [None]:
CONTEXT_WINDOW = 5
scifi_data = get_data(scifi_preprocessed, CONTEXT_WINDOW)
print("Number of samples: ", len(scifi_data))

# Check data has correct shape
for data_sample in scifi_data:
  assert(len(data_sample[0])==2*CONTEXT_WINDOW)

Number of samples:  3264696


In [None]:
# Define the model
scifi_model_window5 = CBOW(vocab_size = len(scifi_vocab), embedding_dim=50)

# Device for GPU Training
device = torch.device("cpu")
if torch.cuda.is_available():
   print("Training on GPU")
   device = torch.device("cuda:0")

# Train
train(model = scifi_model_window5, data=scifi_data, vocab=scifi_vocab, device=device, word_to_ix=scifi_word2ix, NUM_EPOCHS=3)
torch.save(scifi_model_window5, '/content/drive/MyDrive/mlnlp1/exercise-2/scifi_model_window5.pth')


Training on GPU


100%|██████████| 204043/204043 [16:10<00:00, 210.15it/s]


Loss Epoch 0 = 1549487.308421135


100%|██████████| 204043/204043 [16:10<00:00, 210.24it/s]


Loss Epoch 1 = 1448604.5771062374


100%|██████████| 204043/204043 [16:10<00:00, 210.17it/s]


Loss Epoch 2 = 1426921.0356652737
[1549487.308421135, 1448604.5771062374, 1426921.0356652737]


In [None]:
scifi_model_window5_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/scifi_model_window5.pth')

## Part 2 - Testing

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pickle
torch.manual_seed(1)

import pandas as pd
import nltk
nltk.download("punkt")
from nltk.tokenize import wordpunct_tokenize, sent_tokenize
import string

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
## Load data

"""
trip_advisor_url = https://drive.google.com/file/d/1foE1JuZJeu5E_4qVge9kExzhvF32teuF/view
scifi_url = https://drive.google.com/file/d/13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75/view
"""

from google.colab import drive
drive.mount('/content/drive')

# Change the paths accordingly
# trip_advisor_path = '/content/drive/MyDrive/mlnlp1/exercise-2/data/tripadvisor_hotel_reviews_reduced.csv'
# scifi_path = '/content/drive/MyDrive/mlnlp1/exercise-2/data/scifi_reduced.txt'
trip_advisor_path = 'tripadvisor_hotel_reviews_reduced.csv'
scifi_path = 'scifi_reduced.txt'

# DF trip advisor
df_trip = pd.read_csv(trip_advisor_path)
print(df_trip.head())

# Scifi text
text_file = open(scifi_path, "r")
scifi_text = text_file.read()
text_file.close()



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
                                              Review  Rating
0  fantastic service large hotel caters business ...       5
1  great hotel modern hotel good location, locate...       4
2  3 star plus glasgowjust got 30th november 4 da...       4
3  nice stayed hotel nov 19-23. great little bout...       4
4  great place wonderful hotel ideally located me...       5


In [4]:
# Function to pre-process
# 1. Lowercase text
# 2. Tokenize based on sentences - split on "." - we'll get a list of sentences
# 3. For each sentence from 2, tokenize based on punctuations - nltk wordpunct_tokenize - we'll get a list of list of words in a sentence
# 4. Remove punctuations
def preprocess(text):
  text = text.lower()
  text_sent_token = text.split(".")
  text_punct_token = [ wordpunct_tokenize(sent) for sent in text_sent_token]
  text_punct_token_cleaned = []
  # Remove punctuations from tokenized list of lists.
  for txt_list in text_punct_token:
    clean_txt = []
    for txt in txt_list:
      if txt not in string.punctuation:
        clean_txt.append(txt)
    # Take only texts having more than 1 element
    if len(clean_txt)>1:
       text_punct_token_cleaned.append(clean_txt)
  return text_punct_token_cleaned


In [5]:
## Required for loading the model

class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim=256):
      super(CBOW, self).__init__()
      # Embedding layer - Lookup table
      self.embeddings = nn.Embedding(vocab_size, embedding_dim)
      # Layer 1 - Since we'll be summing up the context vectors, the input to this layer will be embedding_dim. Output is a 256 dim. vector
      self.linear1 = nn.Linear(embedding_dim, hidden_dim)
      # Adding non-linearity through ReLU
      self.relu = nn.ReLU()
      # Final layer to get to vocab size dim
      self.linear2 = nn.Linear(hidden_dim, vocab_size)
      # Log softmax to get probabilities
      self.log_softmax = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):
      # inputs are context vectors. Get embeddings for them
      x = self.embeddings(inputs)
      # sum all context vectors
      x = torch.sum(x,axis=1).view(inputs.shape[0],-1)
      #x = sum(x).view(1,-1)
      # Add first layer
      x = self.linear1(x)
      # Add relu
      x = self.relu(x)
      # Add final layer
      x = self.linear2(x)
      # Get log softmax
      x = self.log_softmax(x)
      return x


In [6]:
### Load all models
trip_model_window2_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/trip_model_window2.pth')
trip_model_window5_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/trip_model_window5.pth')
scifi_model_window2_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/scifi_model_window2.pth')
scifi_model_window5_loaded = torch.load('/content/drive/MyDrive/mlnlp1/exercise-2/scifi_model_window5.pth')

### Load word2ix dictionaries
with open('/content/drive/MyDrive/mlnlp1/exercise-2/trip_word2ix.pickle', 'rb') as handle:
    trip_word2ix = pickle.load(handle)

with open('/content/drive/MyDrive/mlnlp1/exercise-2/scifi_word2ix.pickle', 'rb') as handle:
    scifi_word2ix = pickle.load(handle)


In [7]:
### Function to get pretrained numpy embeddings for each word
def get_pretrained_embeddings(model, word2idx):
  if torch.cuda.is_available():
    device = torch.device("cuda:0")

  all_idx = [x for x in range(len(word2idx))]
  all_idx = torch.tensor(all_idx, dtype = torch.long, device = device)
  pretrained_embeds = model.embeddings(all_idx)
  pretrained_embeds = pretrained_embeds.detach().cpu().numpy()

  return pretrained_embeds

### Function to get idx to word dictionary
def get_idx2word(word2idx):
  idx2word = dict()
  for word, idx in word2idx.items():
    idx2word[idx] = word
  assert(len(idx2word)==len(word2idx))
  return idx2word

In [8]:
## Get idx2word dictionaries and pre-trained embeddings for all models
trip_idx2word = get_idx2word(trip_word2ix)
trip_embeds_window2 = get_pretrained_embeddings(trip_model_window2_loaded, trip_word2ix)
trip_embeds_window5 = get_pretrained_embeddings(trip_model_window5_loaded, trip_word2ix)
print("Shape of trip embeddings: Window = 2", trip_embeds_window2.shape)
print("Shape of trip embeddings: Window = 5", trip_embeds_window5.shape)

scifi_idx2word = get_idx2word(scifi_word2ix)
scifi_embeds_window2 = get_pretrained_embeddings(scifi_model_window2_loaded, scifi_word2ix)
scifi_embeds_window5 = get_pretrained_embeddings(scifi_model_window5_loaded, scifi_word2ix)
print("Shape of scifi embeddings: Window = 2", scifi_embeds_window2.shape)
print("Shape of scifi embeddings: Window = 5", scifi_embeds_window5.shape)



Shape of trip embeddings: Window = 2 (36894, 50)
Shape of trip embeddings: Window = 5 (36894, 50)
Shape of scifi embeddings: Window = 2 (111643, 50)
Shape of scifi embeddings: Window = 5 (111643, 50)


In [9]:
## Function to get frequency of words in vocab
def get_vocab_freq(raw_llist):
  vocab_freq = dict()
  for lst in raw_llist:
    for el in lst:
      if el in vocab_freq:
        vocab_freq[el] += 1
      else:
        vocab_freq[el] = 1
  return dict(sorted(vocab_freq.items(), key=lambda kv: kv[1], reverse=True))

In [10]:
## Function to get closest words
## Closest word should be itself - sanity check
## Select closest words having comparable frequency with the input word
## vocab_freq stores the frequency of words in vocabulary
def get_closest_word(word, word_to_index, index_to_word, emb, vocab_freq, freq_thresh=0.25, topn=11, use_freq=False):
  word_distance = []
  pdist = nn.PairwiseDistance()
  i = word_to_index[word]
  v_i = emb[i]
  for j in range(len(word_to_index)):
    word_j = index_to_word[j]
    if use_freq:
      if vocab_freq[word_j] >= freq_thresh*vocab_freq[word]:
        v_j = emb[j]
        word_distance.append((index_to_word[j], float(pdist(v_i, v_j))))
    else:
      v_j = emb[j]
      word_distance.append((index_to_word[j], float(pdist(v_i, v_j))))
  word_distance.sort(key=lambda x: x[1])
  return word_distance[:topn]

#### Testing Trip-Advisor Embeddings

In [11]:
## Get vocab frequency of trip data

# Preprocess
trip_pp = df_trip["Review"].apply(lambda x: preprocess(x))

# Convert to list of lists
trip_pp_llist = []
trip_pp_rows = trip_pp.tolist()
for trip_pp_row in trip_pp_rows:
  for trip_pp_row_list in trip_pp_row:
    trip_pp_llist.append(trip_pp_row_list)

trip_vocab_freq = get_vocab_freq(trip_pp_llist)


## Get vocab frequency of scifi data
scifi_preprocessed = preprocess(scifi_text)
scifi_vocab_freq = get_vocab_freq(scifi_preprocessed)

In [12]:
## Get frequency distribution to select 3 nouns, 3 verbs and 3 adjectives

## Selecting 3 nouns, verbs and adjectives from trip_vocab_freq
trip_nouns = ['hotel','room', 'staff', 'bridge'] # bridge occurs infrequently
trip_verbs = ['booked', 'arrived', 'finding'] # finding occurs infrequently
trip_adj = ['great','helpful','majestic'] # majestic occurs infrequently

In [13]:
## Get top n closest words for nouns
for noun in trip_nouns:
  print("Current Word: ", noun)
  topn_closest_words_window2 = get_closest_word(noun, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window2), trip_vocab_freq, topn=6)
  print("Top words using context window 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(noun, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window5), trip_vocab_freq, topn=6)
  print("Top words using context window 5: \n", topn_closest_words_window5)
  print("\n")

  print("================== Using Frequency Thresholds =====================")
  topn_closest_words_window2 = get_closest_word(noun, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window2), trip_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(noun, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window5), trip_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 5: \n", topn_closest_words_window5)
  print("\n")

Current Word:  hotel
Top words using context window 2: 
 [('hotel', 7.071067557262722e-06), ('daythere', 7.1150407791137695), ('fakes', 7.34874153137207), ('antibacterial', 7.442863941192627), ('averge', 7.481241226196289), ('ptns', 7.485942363739014)]
Top words using context window 5: 
 [('hotel', 7.071067557262722e-06), ('smirnoff', 6.539915084838867), ('you', 6.664706707000732), ('centeredness', 6.683339595794678), ('tallers', 6.783373832702637), ('frescos', 6.785723686218262)]


Top words using frequency thresholds - context window = 2: 
 [('hotel', 7.071067557262722e-06), ('nice', 10.170254707336426), ('room', 10.289427757263184), ('not', 10.764053344726562), ('stay', 10.84562873840332), ('great', 11.0220308303833), ('good', 11.04296588897705), ('n', 11.058713912963867), ('just', 11.50808334350586), ('did', 11.63641357421875), ('staff', 11.810798645019531)]
Top words using frequency thresholds - context window = 5: 
 [('hotel', 7.071067557262722e-06), ('nice', 8.735823631286621), 

In [14]:
## Get top n closest words for verbs
for verb in trip_verbs:
  print("Current Word: ", verb)
  topn_closest_words_window2 = get_closest_word(verb, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window2), trip_vocab_freq, topn=6)
  print("Top words using context window 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(verb, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window5), trip_vocab_freq, topn=6)
  print("Top words using context window 5: \n", topn_closest_words_window5)
  print("\n")

  print("================== Using Frequency Thresholds =====================")
  topn_closest_words_window2 = get_closest_word(verb, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window2), trip_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(verb, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window5), trip_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 5: \n", topn_closest_words_window5)
  print("\n")

Current Word:  booked
Top words using context window 2: 
 [('booked', 7.071067557262722e-06), ('degrading', 7.154239177703857), ('towards', 7.2299885749816895), ('bordeux', 7.316226959228516), ('theregood', 7.339205741882324), ('thorogh', 7.373607635498047)]
Top words using context window 5: 
 [('booked', 7.071067557262722e-06), ('mil', 7.586578369140625), ('breeding', 7.60408353805542), ('peaked', 8.02645206451416), ('becausewe', 8.053140640258789), ('soilded', 8.111804008483887)]


Top words using frequency thresholds - context window = 2: 
 [('booked', 7.071067557262722e-06), ('business', 8.284531593322754), ('book', 8.403924942016602), ('restaurants', 8.440139770507812), ('prices', 8.480257987976074), ('bring', 8.5995454788208), ('yes', 8.767535209655762), ('large', 8.84104061126709), ('2', 8.878698348999023), ('convenient', 8.880636215209961), ('line', 8.884795188903809)]
Top words using frequency thresholds - context window = 5: 
 [('booked', 7.071067557262722e-06), ('use', 8.752

In [15]:
## Get top closes words for adjectives
for adj in trip_adj:
  print("Current Word: ", adj)
  topn_closest_words_window2 = get_closest_word(adj, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window2), trip_vocab_freq, topn=6)
  print("Top words using context window 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(adj, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window5), trip_vocab_freq, topn=6)
  print("Top words using context window 5: \n", topn_closest_words_window5)
  print("\n")

  print("================== Using Frequency Thresholds =====================")
  topn_closest_words_window2 = get_closest_word(adj, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window2), trip_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(adj, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window5), trip_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 5: \n", topn_closest_words_window5)
  print("\n")

Current Word:  great
Top words using context window 2: 
 [('great', 7.071067557262722e-06), ('sooooooo', 7.483932018280029), ('maintainence', 7.635804653167725), ('substituted', 7.64486837387085), ('corner', 7.653206825256348), ('dispassionate', 7.690426826477051)]
Top words using context window 5: 
 [('great', 7.071067557262722e-06), ('tore', 7.171984672546387), ('detour', 7.46504020690918), ('harmful', 7.708715438842773), ('lagoon', 7.713411808013916), ('scubaed', 7.7191596031188965)]


Top words using frequency thresholds - context window = 2: 
 [('great', 7.071067557262722e-06), ('stay', 9.406214714050293), ('helpful', 9.57632064819336), ('clean', 9.777229309082031), ('time', 9.860251426696777), ('nice', 9.872404098510742), ('night', 9.90973949432373), ('people', 9.919472694396973), ('best', 10.043133735656738), ('beach', 10.0620698928833), ('small', 10.094267845153809)]
Top words using frequency thresholds - context window = 5: 
 [('great', 7.071067557262722e-06), ('place', 9.2109

#### Testing Scifi Embeddings

In [16]:

## Selecting 3 nouns, verbs and adjectives from trip_vocab_freq
scifi_nouns = ['man', 'eyes', 'president'] # president occurs infrequently
scifi_verbs = ['said', 'looked', 'eat'] # eat occurs infrequently
scifi_adj = ['good', 'old', 'poor'] # poor occurs infrequently

In [17]:
## Get top n closest words for nouns
for noun in scifi_nouns:
  print("Current Word: ", noun)
  topn_closest_words_window2 = get_closest_word(noun, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window2), scifi_vocab_freq, topn=6)
  print("Top words using context window 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(noun, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window5), scifi_vocab_freq, topn=6)
  print("Top words using context window 5: \n", topn_closest_words_window5)
  print("\n")

  print("================== Using Frequency Thresholds =====================")
  topn_closest_words_window2 = get_closest_word(noun, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window2), scifi_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(noun, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window5), scifi_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 5: \n", topn_closest_words_window5)
  print("\n")

Current Word:  man
Top words using context window 2: 
 [('man', 7.071067557262722e-06), ('ytccuuough', 6.1966753005981445), ('loiew', 6.429879665374756), ('ofmush', 6.569863796234131), ('straps', 6.574796676635742), ('helmet', 6.598606109619141)]
Top words using context window 5: 
 [('man', 7.071067557262722e-06), ('hejd', 6.230587959289551), ('headstart', 6.558587551116943), ('matriarch', 6.641236305236816), ('looung', 6.741540908813477), ('triumphantly', 6.791837692260742)]


Top words using frequency thresholds - context window = 2: 
 [('man', 7.071067557262722e-06), ('without', 7.140392780303955), ('still', 7.851052284240723), ('where', 8.03598690032959), ('another', 8.102642059326172), ('mind', 8.128649711608887), ('first', 8.157269477844238), ('no', 8.287017822265625), ('again', 8.295594215393066), ('my', 8.33771800994873), ('let', 8.364876747131348)]
Top words using frequency thresholds - context window = 5: 
 [('man', 7.071067557262722e-06), ('around', 8.008111000061035), ('abo

In [18]:
## Get top n closes words for verbs
for verb in scifi_verbs:
  print("Current Word: ", verb)
  topn_closest_words_window2 = get_closest_word(verb, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window2), scifi_vocab_freq, topn=6)
  print("Top words using context window 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(verb, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window5), scifi_vocab_freq, topn=6)
  print("Top words using context window 5: \n", topn_closest_words_window5)
  print("\n")

  print("================== Using Frequency Thresholds =====================")
  topn_closest_words_window2 = get_closest_word(verb, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window2), scifi_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(verb, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window5), scifi_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 5: \n", topn_closest_words_window5)
  print("\n")

Current Word:  said
Top words using context window 2: 
 [('said', 7.071067557262722e-06), ('psychopathological', 7.161562442779541), ('teleportations', 7.164734363555908), ('recordsi', 7.181768417358398), ('walkdown', 7.1959943771362305), ('tepeni', 7.211658000946045)]
Top words using context window 5: 
 [('said', 7.071067557262722e-06), ('drfflif', 6.66356086730957), ('urigiaaf', 6.882411956787109), ('archaeologists', 6.93709135055542), ('beluthahatchie', 6.970693111419678), ('jmfe', 7.1472649574279785)]


Top words using frequency thresholds - context window = 2: 
 [('said', 7.071067557262722e-06), ('this', 8.031986236572266), ('here', 8.151091575622559), ('through', 8.893613815307617), ('s', 8.981287956237793), ('?"', 9.002523422241211), ('it', 9.012069702148438), ('even', 9.063409805297852), ('how', 9.14681625366211), ('so', 9.1686429977417), ('are', 9.196456909179688)]
Top words using frequency thresholds - context window = 5: 
 [('said', 7.071067557262722e-06), ('don', 8.41417407

In [19]:
## Get top n closes words for adjectives
for adj in scifi_adj:
  print("Current Word: ", adj)
  topn_closest_words_window2 = get_closest_word(adj, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window2), scifi_vocab_freq, topn=6)
  print("Top words using context window 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(adj, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window5), scifi_vocab_freq, topn=6)
  print("Top words using context window 5: \n", topn_closest_words_window5)
  print("\n")

  print("================== Using Frequency Thresholds =====================")
  topn_closest_words_window2 = get_closest_word(adj, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window2), scifi_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 2: \n", topn_closest_words_window2)
  topn_closest_words_window5 = get_closest_word(adj, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window5), scifi_vocab_freq, use_freq=True)
  print("Top words using frequency thresholds - context window = 5: \n", topn_closest_words_window5)
  print("\n")

Current Word:  good
Top words using context window 2: 
 [('good', 7.071067557262722e-06), ('skunki', 6.017290115356445), ('philosopher', 6.096534252166748), ('transfusion', 6.340238571166992), ('pfestige', 6.363999843597412), ('thirtyfifth', 6.375101566314697)]
Top words using context window 5: 
 [('good', 7.071067557262722e-06), ('sfmagazine', 6.839204788208008), ('hurrah', 6.955796718597412), ('eightyfive', 7.0804643630981445), ('emulation', 7.096042156219482), ('whisper', 7.142617702484131)]


Top words using frequency thresholds - context window = 2: 
 [('good', 7.071067557262722e-06), ('feel', 6.788490295410156), ('except', 7.207240104675293), ('something', 7.243300437927246), ('began', 7.376669406890869), ('yet', 7.458994388580322), ('hours', 7.4830002784729), ('going', 7.495906352996826), ('open', 7.546849250793457), ('at', 7.628347396850586), ('got', 7.764412879943848)]
Top words using frequency thresholds - context window = 5: 
 [('good', 7.071067557262722e-06), ('around', 8.3

#### Common words in both the datasets

In [20]:
## Get common vocab
#common_vocab = trip_vocab.intersection(scifi_vocab)
common_words = ["festive","incredible"]

for word in common_words:
  print("Current Word: ", word)
  top5_scifi_window2 = get_closest_word(word, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window2),
                                                scifi_vocab_freq, use_freq=True)
  print("Top words using context window 2 on Sci-fi Dataset: \n", top5_scifi_window2)

  top5_hotel_window2 = get_closest_word(word, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window2),
                                                trip_vocab_freq, use_freq=True)
  print("Top words using context window 2 on Trip Advisor: \n", top5_hotel_window2)

  top5_scifi_window5 = get_closest_word(word, scifi_word2ix, scifi_idx2word,
                                                torch.tensor(scifi_embeds_window5),
                                                scifi_vocab_freq, use_freq=True)
  print("Top words using context window 5 on Sci-fi Dataset: \n", top5_scifi_window5)


  top5_hotel_window5 = get_closest_word(word, trip_word2ix, trip_idx2word,
                                                torch.tensor(trip_embeds_window5),
                                                trip_vocab_freq, use_freq=True)
  print("Top words using context window 5 on Trip Advisor: \n", top5_hotel_window5)

  print("\n")

Current Word:  festive
Top words using context window 2 on Sci-fi Dataset: 
 [('festive', 7.071067557262722e-06), ('f', 6.074956893920898), ('macnessa', 6.1809821128845215), ('waxen', 6.191169261932373), ('dreadnaughts', 6.225739479064941), ('somethink', 6.276749610900879), ('drivel', 6.383573055267334), ('lifelessness', 6.413447380065918), ('corbacco', 6.429171085357666), ('structures', 6.483883380889893), ('rex', 6.485265731811523)]
Top words using context window 2 on Trip Advisor: 
 [('festive', 7.071067557262722e-06), ('itchy', 5.819334983825684), ('briefing', 5.927561283111572), ('argumentative', 6.133767127990723), ('definelty', 6.149537563323975), ('infallible', 6.186091899871826), ('brains', 6.272236347198486), ('swabs', 6.290041446685791), ('venezuela', 6.337795734405518), ('loaction', 6.351245403289795), ('gang', 6.364889144897461)]
Top words using context window 5 on Sci-fi Dataset: 
 [('festive', 7.071067557262722e-06), ('sluiced', 5.435795783996582), ('microcircuits', 5.55