<a href="https://colab.research.google.com/github/saphjra/atmt_2024/blob/main/current_Exercise_2_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%matplotlib inline

## Start with this cell to load data and skip training
[link text](https://)


### Source: [link](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [None]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)


<torch._C.Generator at 0x7d1333fe4e90>

# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.

# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.


## Exercise Layout
### 1. <u>Training CBOW Embeddings</u>
1.1) Implement a CBOW Model by completing ```class CBOW(nn.Module)``` and train it on ```raw_text```.    

1.2) Load Datasets ```tripadvisor_hotel_reviews_reduced.csv``` and ```scifi_reduced.txt```.     

1.3) Decide preprocessing steps by completing the function ```def custom_preprocess()```. Describe your decisions. Note that it's your choice to create different preprocessing functions for hotel reviews and scifi datasets or use the same preprocessing function.             

1.4) Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.   

1.5) Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset. Are predictions made by the model sensitive towards the context size?
     
1.6) Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset.  


### 2. <u>Test your Embeddings</u>
Note - Do the following for CBOW2, and optionally for CBOW5

2.1) For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model. List them in your report and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   

2.2) Do the same for Sci-Fi dataset.   

2.3) How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.   

2.4) Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings. Do they have different neighbours? If yes, can you reason why?    

2.5) What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?   


### Tips

1. Switch from CPU to a GPU instance after you have confirmed that your training procedure is working correctly.
2. You can always save your intermediate results (embeddings, preprocessed dataset, model, etc.) in your google drive via colab



### 1.1 Create a CBOW Model by completing ```class CBOW(nn.Module)``` and test it on ```raw_text```
Implement CBOW in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.

### 1.2 Load Datasets

In [None]:
### Load Datasets tripadvisor_hotel_reviews_reduced.csv and scifi_reduced.txt

!gdown 1foE1JuZJeu5E_4qVge9kExzhvF32teuF # For Hotel Reviews
!gdown 13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75 # For Scifi-Text

Downloading...
From: https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF
To: /content/tripadvisor_hotel_reviews_reduced.csv
100% 7.36M/7.36M [00:00<00:00, 45.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /content/scifi_reduced.txt
100% 43.1M/43.1M [00:00<00:00, 95.4MB/s]


### 1.3 Preprocess Datasets
### 🗒❓ Describe your decisions for preprocessing the datasets

In [4]:
import pandas as pd
df = pd.read_csv('tripadvisor_hotel_reviews_reduced.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'tripadvisor_hotel_reviews_reduced.csv'

In [7]:
### Complete the preprocessing function and apply it to the datasets
import re
def custom_preprocess(df, col):
    def tokenize(text):
    # Convert to lowercase and split by non-alphabetic characters, very minor proprocessing steps, we could add more. However the dataset, seemed to be already preprocessed
      tokens = re.findall(r'\b\w+\b', text.lower())
      return tokens


    df_preproccessed = df[col].apply(tokenize)

    vocab_set = set(token for tokens in df_preproccessed for token in tokens)
    return df_preproccessed, vocab_set



In [None]:
df_pre, vocab = custom_preprocess(df, 'Review')
df_pre.info()

<class 'pandas.core.series.Series'>
RangeIndex: 10000 entries, 0 to 9999
Series name: Review
Non-Null Count  Dtype 
--------------  ----- 
10000 non-null  object
dtypes: object(1)
memory usage: 78.2+ KB


In [8]:
def create_vocab_and_data(df, col, context_size=2):
    # By deriving a set from `raw_text`, we deduplicate the array
    df, vocab = custom_preprocess(df, col)
    word_to_ix = {word: i for i, word in enumerate(vocab)}
    data = []
    for j in range(len(df)):
      raw_text = df[j]
      # print(raw_text)
      for i in range(context_size, len(raw_text) - context_size):
          context = raw_text[i - context_size:i] + raw_text[i + 1:i + context_size + 1]
          target = raw_text[i]
          data.append((context, target))
    return data, word_to_ix, vocab


In [None]:
data, word_to_ix, vocab = create_vocab_and_data(df, 'Review')
print(data[0:5])



[(['fantastic', 'service', 'hotel', 'caters'], 'large'), (['service', 'large', 'caters', 'business'], 'hotel'), (['large', 'hotel', 'business', 'corporates'], 'caters'), (['hotel', 'caters', 'corporates', 'serve'], 'business'), (['caters', 'business', 'serve', 'provided'], 'corporates')]


In [None]:
print(int(len(data)*0.9))

946944


In [None]:
data, word_to_ix, vocab = create_vocab_and_data(df, 'Review', context_size=5)
print(data[0:5])


[(['fantastic', 'service', 'large', 'hotel', 'caters', 'corporates', 'serve', 'provided', 'better', 'wife'], 'business'), (['service', 'large', 'hotel', 'caters', 'business', 'serve', 'provided', 'better', 'wife', 'experienced'], 'corporates'), (['large', 'hotel', 'caters', 'business', 'corporates', 'provided', 'better', 'wife', 'experienced', 'nothing'], 'serve'), (['hotel', 'caters', 'business', 'corporates', 'serve', 'better', 'wife', 'experienced', 'nothing', 'short'], 'provided'), (['caters', 'business', 'corporates', 'serve', 'provided', 'wife', 'experienced', 'nothing', 'short', 'world'], 'better')]


In [9]:
def make_context_vector(context, word_to_ix):

    idxs = [word_to_ix[w] for w in context] if type(context)==list  else [word_to_ix[context]]

    return torch.tensor(idxs, dtype=torch.long)


print(make_context_vector(data[0][0], word_to_ix))  # example
print(make_context_vector(data[0][1], word_to_ix))  # example
data[0][1]


NameError: name 'data' is not defined

In [10]:
def data_to_tensor(data, word_to_ix):
    data_tensor_context = []
    data_tensor_target  = []
    for context, target in data:
        context_idxs = make_context_vector(context, word_to_ix)
        target_idx = word_to_ix[target]
        data_tensor_context.append(context_idxs)
        data_tensor_target.append(target_idx)
    data_tensor_context = torch.stack(data_tensor_context)
    data_tensor_target = torch.tensor(data_tensor_target, dtype=torch.long)
    return data_tensor_context, data_tensor_target.unsqueeze(1)


In [None]:
#data_to_tensor(data[:10], word_to_ix)

In [None]:
split_index= int(len(data)*0.9)

train_X, train_Y = data_to_tensor(data[:split_index], word_to_ix)
test_X, test_Y = data_to_tensor(data[split_index:], word_to_ix)
print(train_X.shape)
print(train_Y.shape)

torch.Size([946944, 4])
torch.Size([946944, 1])


In [None]:
print(test_X.shape)
print(test_Y.shape)

torch.Size([105216, 4])
torch.Size([105216, 1])


In [None]:
len(data)

1052160

In [11]:
from torch.utils.data import IterableDataset, DataLoader
class MyDataset(IterableDataset):
    def __init__(self, data_X, data_y):
        assert len(data_X) == len(data_y)
        self.data_X = data_X.to(device)
        self.data_y = data_y.to(device)

    def __len__(self):
        return len(self.data_X)

    def __iter__(self):
        for i in range(len(self.data_X)):
            yield (self.data_X[i], self.data_y[i])

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
train_set = MyDataset(train_X, train_Y)
test_set = MyDataset(test_X, test_Y)

cuda


In [12]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)


class CBOW(nn.Module):


    def __init__(self, vocab_size, embedding_dim, hidden_dim, context_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * 2* embedding_dim, hidden_dim)     # multipy with two because you have a left anfd a right ontext
        self.linear2 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((inputs.shape[0], -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=-1)
        return log_probs

# create your model and train.  here are some functions to help you make
# the data ready for use by your module




In [13]:

def eval(model, test_loader)
  predictions = []
  true_labels = []
  model.eval()  # without dropout and alike not really necessary
  with torch.no_grad():  # disable gradient computation, since it is only needed when backward() is called
      for test_X, test_y in test_loader:
          pred_y = model(test_X)
          #print(pred_y, test_y)
          batch_preds = [x.item() for x in torch.argmax(pred_y, dim=-1)]
          predictions.extend(batch_preds)
          true_labels.extend([y.item() for y in test_y.squeeze()])


  from sklearn.metrics import accuracy_score
  acc = accuracy_score(true_labels, predictions)
  acc




SyntaxError: expected ':' (<ipython-input-13-b3c16cd36d19>, line 1)

In [None]:
print(model.embeddings)

Embedding(36872, 50)


In [14]:
def create_dataloader(df, col, BATCH_SIZE=16):
    """wrapper function to create all the necessary data"""
    data, word_to_ix, vocab = create_vocab_and_data(df, col, context_size=CONTEXT_SIZE)
    # data = data[:100]
    split_index= int(len(data)*0.9)
    train_X, train_Y = data_to_tensor(data[:split_index], word_to_ix)
    test_X, test_Y = data_to_tensor(data[split_index:], word_to_ix)
    print(train_X.shape)
    print(train_Y.shape)
    train_set = MyDataset(train_X, train_Y)
    test_set = MyDataset(test_X, test_Y)
    train_loader = DataLoader(train_set, batch_size=BATCH_SIZE)
    test_loader = DataLoader(test_set, batch_size=BATCH_SIZE)
    return train_loader, test_loader, vocab, word_to_ix

In [26]:
def trainig_loop(model, train_loader, optimizer, loss_func, num_epochs, device):
    """wrapper function for training for better reusability"""
    model.train()
    num_batches = len(train_loader)
    for epoch in range(1, num_epochs + 1):
      for batch_num, (inputs, y_true) in enumerate(train_loader, 1):
          optimizer.zero_grad()
          #print(inputs.shape, y_true.shape, len(vocab), y_true.squeeze().shape)
          y_pred = model(inputs)
          loss = loss_func(y_pred, y_true.squeeze())
          loss_batch = loss.item()
          loss.backward()
          optimizer.step()
      print(f'Epoch [{epoch}/{num_epochs}], loss: {loss_batch:.4f}]')



In [None]:
artifical breaking point

### 1.4 Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.

In [None]:
CONTEXT_SIZE = 2
train_loader, test_loader, vocab, word_to_ix = create_dataloader(df, 'Review')

torch.Size([946944, 4])
torch.Size([946944, 1])


### 1.5 Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset.  

🗒❓ Are predictions made by the model sensitive towards the context size?

In [None]:
dont run it again

EMBEDDING_DIM = 50
HIDDEN_DIM = 128
model = CBOW(len(vocab), EMBEDDING_DIM, HIDDEN_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)
loss_func = nn.CrossEntropyLoss()
num_epochs = 15
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

model.to(device)
trainig_loop(model, train_loader, optimizer, loss_func, num_epochs, device)
Modelname = "CBOW2new.pt"
PATH  =f"/content/drive/MyDrive/Colab Notebooks/{Modelname}"

# Save
torch.save(model, PATH)


NameError: name 'CBOW' is not defined

## No need to run the code above

In [None]:
# Load
from google.colab import drive
drive.mount('/content/drive')

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)


class CBOW(nn.Module):


    def __init__(self, vocab_size, embedding_dim, hidden_dim, context_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * 2* embedding_dim, hidden_dim)     # multipy with two because you have a left anfd a right ontext
        self.linear2 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((inputs.shape[0], -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=-1)
        return log_probs

model = torch.load("/content/drive/MyDrive/Colab Notebooks/CBOW2.pt")
model.eval()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
model.to(device)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  model = torch.load("/content/drive/MyDrive/Colab Notebooks/CBOW2.pt")


cuda


CBOW(
  (embeddings): Embedding(36872, 50)
  (linear1): Linear(in_features=200, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=36872, bias=True)
)

In [None]:
with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW2.ptvocab.txt','r') as data:
      print(data.read())
      data.seek(0)
      vocab = eval(data.read())

with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW2.ptword_to_ix.txt','r') as data:

      data.seek(0)
      word_to_ix = eval(data.read())
      print(word_to_ix)


In [None]:

print(word_to_ix['ls'])
print(len(vocab))

2
36872


In [23]:
import torch.nn as nn

def get_closest_word(word, topn=5):
    index_to_word = {value: key for key, value in word_to_ix.items()}   # reversed dictonair to find tokens by their index in the vocab
    word_distance = []
    model.eval()
    emb = model.embeddings
    pdist = nn.PairwiseDistance()
    i = word_to_ix[word]

    lookup_tensor_i = torch.tensor([i], dtype=torch.long).to(device)
    #print(lookup_tensor_i)
    v_i = emb(lookup_tensor_i)
    #print(i, lookup_tensor_i, v_i)
    for j in range(1, len(vocab)):
      if j != i:
          lookup_tensor_j = torch.tensor([j], dtype=torch.long).to(device)
          v_j = emb(lookup_tensor_j)
          #print(j, lookup_tensor_j, v_j)
          word_distance.append((index_to_word[j], float(pdist(v_i, v_j))))
    word_distance.sort(key=lambda x: x[1])
    return word_distance[:topn]

In [None]:
get_closest_word('hotel')

[('resort', 3.125905752182007),
 ('rooms', 3.7057602405548096),
 ('great', 3.832054376602173),
 ('excellent', 3.9706552028656006),
 ('room', 4.023690700531006)]

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
print(get_closest_word('room'))
print(get_closest_word('atlantic'))
print(get_closest_word('beautiful'))
print(get_closest_word('great'))
print(get_closest_word('did'))
print(get_closest_word('stay'))



[('rooms', 2.089414358139038), ('hotel', 4.023687362670898), ('bathroom', 4.101468086242676), ('suite', 4.250602722167969), ('floor', 4.554897785186768)]
[('nb', 5.9741644859313965), ('place', 6.12546443939209), ('notices', 6.195394992828369), ('carme', 6.19688606262207), ('days', 6.261256217956543)]
[('nice', 3.7269740104675293), ('amazing', 3.8383333683013916), ('clean', 3.928826332092285), ('great', 4.006716251373291), ('lovely', 4.092239856719971)]
[('good', 2.28143310546875), ('excellent', 2.3379077911376953), ('wonderful', 2.5288219451904297), ('nice', 2.9609107971191406), ('lovely', 3.273216485977173)]
[('does', 3.630962371826172), ('went', 4.198888301849365), ('not', 4.366249084472656), ('got', 4.486250877380371), ('think', 4.57169246673584)]
[('visit', 4.014122486114502), ('think', 4.113770484924316), ('hotel', 4.233322620391846), ('staying', 4.359683990478516), ('experience', 4.486968994140625)]


KeyError: 'well-furnished'

In [None]:
print(get_closest_word('ineffective'))
print(get_closest_word('copley'))
print(get_closest_word('cracked'))

[('coordinators', 7.205868721008301), ('cynicism', 7.256261348724365), ('sophisticated', 7.271458625793457), ('deters', 7.288957595825195), ('911', 7.342331409454346)]
[('clean', 6.673245906829834), ('diva', 6.845100402832031), ('franglish', 6.883596897125244), ('handwritten', 6.986283779144287), ('recommend', 7.025819778442383)]
[('trian', 6.5322346687316895), ('spacous', 6.660488128662109), ('faux', 6.74277400970459), ('crossisants', 6.7488861083984375), ('knocking', 6.758415222167969)]


In [None]:
words = ['hotel', 'room', 'copley', 'did', 'stay', 'trian' 'beautiful', 'ineffective', 'cracked']

to figure out what are mor frequent and less frequent words, we could use the index numbers in the vocab. I think the higher the index number the rare more rare, as they only get added if not already in the vocab? but i am not sure as we create a set from it, and they are not ordered


In [None]:
CONTEXT_SIZE = 5
# train_loader, test_loader, vocab, word_to_ix = create_dataloader(df, 'Review')
# Modelname = "CBOW5.pt"
# PATH  =f"/content/drive/MyDrive/Colab Notebooks/{Modelname}"


with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW5.ptvocab.txt','r') as data:
      print(data.read())
      data.seek(0)
      vocab = eval(data.read())

with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW5.ptword_to_ix.txt','r') as data:

      data.seek(0)
      word_to_ix = eval(data.read())
      print(word_to_ix)


torch.Size([892950, 10])
torch.Size([892950, 1])


In [None]:
EMBEDDING_DIM = 50
HIDDEN_DIM = 128
model = CBOW(len(vocab), EMBEDDING_DIM, HIDDEN_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)
loss_func = nn.CrossEntropyLoss()
num_epochs = 15
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

model.to(device)
trainig_loop(model, train_loader, optimizer, loss_func, num_epochs, device)

Modelname = "CBOW5new.pt"
PATH  =f"/content/drive/MyDrive/Colab Notebooks/{Modelname}"

torch.save(model, PATH)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch [5/15], batch: [51417/55810, loss: 5.1319]
Epoch [5/15], batch: [51418/55810, loss: 7.0256]
Epoch [5/15], batch: [51419/55810, loss: 6.4952]
Epoch [5/15], batch: [51420/55810, loss: 6.2850]
Epoch [5/15], batch: [51421/55810, loss: 4.8743]
Epoch [5/15], batch: [51422/55810, loss: 5.6820]
Epoch [5/15], batch: [51423/55810, loss: 4.5090]
Epoch [5/15], batch: [51424/55810, loss: 5.2988]
Epoch [5/15], batch: [51425/55810, loss: 6.0163]
Epoch [5/15], batch: [51426/55810, loss: 6.9619]
Epoch [5/15], batch: [51427/55810, loss: 6.5024]
Epoch [5/15], batch: [51428/55810, loss: 5.4902]
Epoch [5/15], batch: [51429/55810, loss: 6.0992]
Epoch [5/15], batch: [51430/55810, loss: 7.4890]
Epoch [5/15], batch: [51431/55810, loss: 7.3463]
Epoch [5/15], batch: [51432/55810, loss: 6.4128]
Epoch [5/15], batch: [51433/55810, loss: 7.0579]
Epoch [5/15], batch: [51434/55810, loss: 7.1029]
Epoch [5/15], batch: [51435/55810, loss: 6.1884]
Epoc

In [None]:
with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW5.ptvocab.txt','r') as data:
      print(data.read())
      data.seek(0)
      vocab = eval(data.read())

with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW5.ptword_to_ix.txt','r') as data:

      data.seek(0)
      word_to_ix = eval(data.read())
      print(word_to_ix)
get_closest_word('hotel')


[('wak', 6.224923610687256),
 ('masterpieces', 6.45383358001709),
 ('muy', 6.520915985107422),
 ('agno', 6.52716588973999),
 ('reaked', 6.556485652923584)]

In [None]:
words = ['hotel', 'room', 'copley', 'did', 'stay', 'trian' 'beautiful', 'ineffective', 'cracked']
for word in words:
  print(get_closest_word(word))

### 2.1 For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. (CBOW2 and optionally for CBOW5)
Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model.    

🗒❓ List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   


### 1.6 Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset

In [1]:
!gdown 13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75 # For Scifi-Text

with open('scifi_reduced.txt', 'r') as file:
    text = file.read().split(".")

print(text[:10])


Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /content/scifi_reduced.txt
100% 43.1M/43.1M [00:00<00:00, 56.8MB/s]
[' A chat with the editor  i #  science fiction magazine called IF', ' The title was selected after much thought because of its brevity and on the theory it is indicative of the field and will be easy to remember', " The tentative title that just morning and couldn't remember it until we'd had a cup of coffee, it was summarily discarded", ' A great deal of thought and effort lias gone into the formation of this magazine', ' We have had the aid of several very talented and generous people, for which we are most grateful', ' Much is due them for their warmhearted assistance', ' And now that the bulk of the formative work is done, we will try to maintain IF as one of the finest books on the market', '  t a great public demand for our magazine', ' In short, why will you buy IF? We cannot, in honesty, say we will publish at all times t

In [5]:
import pandas as pd

sci_df = pd.DataFrame(text, columns=['Sentence'])
sci_df.head()

Unnamed: 0,Sentence
0,A chat with the editor i # science fiction ...
1,The title was selected after much thought bec...
2,The tentative title that just morning and cou...
3,A great deal of thought and effort lias gone ...
4,We have had the aid of several very talented ...


In [17]:
CONTEXT_SIZE = 2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
train_loader, test_loader, vocab, word_to_ix = create_dataloader(sci_df, 'Sentence')

cuda
torch.Size([5062096, 4])
torch.Size([5062096, 1])


In [21]:
with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW2_scifi.ptvocab.txt','r') as data:
      print(data.read())
      data.seek(0)
      vocab = eval(data.read())

with open(f'/content/drive/MyDrive/Colab Notebooks/CBOW2_scifi.ptword_to_ix.txt','r') as data:

      data.seek(0)
      word_to_ix = eval(data.read())
      print(word_to_ix)



In [25]:
print(word_to_ix['alien'])
print(len(vocab))

9438
111577


In [None]:
EMBEDDING_DIM = 50
HIDDEN_DIM = 128
model = CBOW(len(vocab), EMBEDDING_DIM, HIDDEN_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)
loss_func = nn.CrossEntropyLoss()
num_epochs = 3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

model.to(device)
trainig_loop(model, train_loader, optimizer, loss_func, num_epochs, device)

Modelname = "CBOW2_scifi.pt"
PATH  =f"/content/drive/MyDrive/Colab Notebooks/{Modelname}"

# Save
torch.save(model, PATH)

cuda


In [24]:
get_closest_word('alien')

NameError: name 'model' is not defined

In [None]:
sci_words = ['']

### 2.2 Repeat 2.1 for SciFi Dataset

🗒❓ List your findings for SciFi Dataset as well, similarly to 2.1

### 2.3 🗒❓ How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.

### 2.4 Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings.

🗒❓ Do they have different neighbours? If yes, can you reason why?

### 2.5 🗒❓ What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?    

### Report
The lab report should contain a detailed description of the approaches you have used to solve this exercise. Please also include results.

Answers for the questions marked 🗒❓ goes here as well