# Assignment 1

<b>Group [fill in group number]</b>
* <b> Student 1 </b> : Tommaso Bonomo (1511831)
* <b> Student 2 </b> : Mattia Molon (1511866)

**Reading material**
* [1] Mikolov, Tomas, et al. "[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)" Advances in neural information processing systems. 2013. 

<b><font color='red'>NOTE</font></b> When submitting your notebook, please make sure that the training history of your model is visible in the output. This means that you should **NOT** clean your output cells of the notebook. Make sure that your notebook runs without errors in linear order.



# Question 1 - Keras implementation (10 pt)

### Word embeddings
Build word embeddings with a Keras implementation where the embedding vector is of length 50, 150 and 300. Use the Alice in Wonderland text book for training. Use a window size of 2 to train the embeddings (`window_size` in the jupyter notebook). 

1. Build word embeddings of length 50, 150 and 300 using the Skipgram model
2. Build word embeddings of length 50, 150 and 300 using CBOW model
3. Analyze the different word embeddings:
    - Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in the paper. Do not use existing libraries for this task such as Gensim. 
Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. $e_{x}$ denotes the embedding of word $x$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.
    - Give at least 5 different  examples of analogies.
    - Compare the performance on the analogy tasks between the word embeddings and briefly discuss your results.

4. Discuss:
  - Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


<b>HINT</b> See practical 3.1 for some helpful code to start this assignment.


### Import libraries

In [0]:
%tensorflow_version 2.x

In [0]:
import numpy as np
import keras.backend as K
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Reshape, Layer, Flatten
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

# other helpful libraries
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as nn
from matplotlib import pylab
import pandas as pd

# typing
from typing import List, Union, Tuple, Dict

In [40]:
print(tf.__version__) #  check what version of TF is imported

2.2.0


### Import file

If you use Google Colab, you need to mount your Google Drive to the notebook when you want to use files that are located in your Google Drive. Paste the authorization code, from the new tab page that opens automatically when running the cell, in the cell below.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Navigate to the folder in which `alice.txt` is located. Make sure to start path with '/content/drive/My Drive/' if you want to load the file from your Google Drive.

In [0]:
# TOMMASO
# %cd /content/drive/My Drive/deep learning (Tommaso)/                           

# MATTIA
%cd drive/My\ Drive/Università/Deep\ Learning/Assignment\ 1

In [0]:
file_name = 'alice.txt'
corpus = open(file_name).readlines()

### Data preprocessing

See Practical 3.1 for an explanation of the preprocessing steps done below.

In [0]:
# Removes sentences with fewer than 3 words
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

# remove punctuation in text and fit tokenizer on entire corpus
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

# convert text to sequence of integer values
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus) # total number of words in the corpus
V = len(tokenizer.word_index) + 1 # total number of unique words in the corpus

In [45]:
n_samples, V

(27165, 2557)

In [46]:
# example of how word to integer mapping looks like in the tokenizer
print(list((tokenizer.word_index.items()))[:5])

[('the', 1), ('and', 2), ('to', 3), ('a', 4), ('it', 5)]


In [0]:
# parameters
window_size = 2
window_size_corpus = 4

## Task 1.1 - Skipgram
Build word embeddings of length 50, 150 and 300 using the Skipgram model.

In [0]:
#prepare data for skipgram
def generate_data_skipgram(
    corpus: List[List[int]],
    window_size: int,
    seed: int = 4322
) -> Tuple[np.ndarray, np.ndarray]:
    rand_generator = np.random.default_rng(seed=seed)
    
    x: List[int] = []
    y: List[int] = []
    for line in corpus:    
        for idx, word in enumerate(line):
            word_range = rand_generator.integers(1, window_size, endpoint=True)
            
            surrounding_words = line[idx - word_range : idx] + line[idx + 1 : idx + word_range + 1]
            x += [word] * len(surrounding_words)
            y += surrounding_words
        
    
    np_x = np.array(x)
    np_y = np.array(y)
    
    return np_x, np_y

In [0]:
# create training data
x, y = generate_data_skipgram(corpus, window_size)

In [0]:
# create skipgram architecture
emb_size = 150
skipgram = Sequential([
    Embedding(V, emb_size, input_length=1),
    Flatten(),
    Dense(V, activation="softmax", use_bias=False, kernel_regularizer="l2")
])

skipgram.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

skipgram.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1, 150)            383550    
_________________________________________________________________
flatten_2 (Flatten)          (None, 150)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2557)              383550    
Total params: 767,100
Trainable params: 767,100
Non-trainable params: 0
_________________________________________________________________


<b>HINT</b>: To increase training speed of your model, you can use the free available GPU power in Google Colab. Go to `Edit` --> `Notebook Settings` --> select `GPU` under `hardware accelerator`.

In [0]:
# train skipgram model
_ = skipgram.fit(x, y, batch_size=32, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [0]:
import pickle as pkl
from os.path import join

# save embeddings of the model
def save_embeddings(
    emb_matrix: np.ndarray,
    file_name: str,
    directory: str = './Embeddings',
    tokenizer: Tokenizer = tokenizer
    ) -> Dict[str, np.ndarray]:
  
  emb_dict = {}
  for word, i in tokenizer.word_index.items():
    emb_dict[word] = emb_matrix[i,:]

  file_name += '.pickle'
  path = join(directory, file_name)
  with open(path, 'wb+') as file:
      pkl.dump(emb_dict, file, protocol=pkl.HIGHEST_PROTOCOL)

  return emb_dict

# read embeddings
def load_embeddings(
    file_name: str,
    directory: str = './Embeddings'
    ) -> Dict[str, np.ndarray]:

  file_name += '.pickle'
  path = join(directory, file_name)
  emb_dict = {}
  with open(path, 'rb') as file:
    emb_dict = pkl.load(file)

  return emb_dict

In [0]:
emb_matrix = skipgram.get_weights()[0]
_ = save_embeddings(emb_matrix, 'skp_150')

## Task 1.2 - CBOW

Build word embeddings of length 50, 150 and 300 using CBOW model.

In [0]:
# prepare data for CBOW
def generate_data_cbow(
    corpus: List[List[int]],
    window_size: int
) -> Tuple[np.ndarray, np.ndarray]:
    raw_x = []
    raw_y = []
    for line in corpus:
        for word_idx, word in enumerate(line):
            min_idx = max(0, word_idx - window_size)
            max_idx = min(len(line), word_idx + window_size)
            raw_x.append(
                line[min_idx : word_idx] + line[word_idx + 1 : max_idx + 1]
            )
            raw_y.append(word)
    x = sequence.pad_sequences(raw_x) # Pads all sequences to a fixed length
    y = np.array(raw_y)
    return x, y
    
# create training data
cbow_x, cbow_y = generate_data_cbow(corpus, window_size)

# create CBOW architecture
class AverageLayer(Layer):
    """Custom layer to average out the result of an Embedding layer"""

    def __init__(self):
        super(AverageLayer, self).__init__()
    
    def call(self, inputs):
        return tf.math.reduce_mean(inputs, axis=1)

emb_size = 100
cbow = Sequential([
    Embedding(V, emb_size, mask_zero=True, input_length=window_size_corpus),
    AverageLayer(),
    Dense(V, activation="softmax", use_bias=False, kernel_regularizer="l2")
])

cbow.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

# train CBOW model
_ = cbow.fit(cbow_x, cbow_y, batch_size=32, epochs=3)

## Task 1.3 - Analogy function

Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in [1]. Do not use existing libraries for this task such as Gensim. Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. 

In a perfect scenario, we would like that this analogy ( $e_{king} - e_{queen} + e_{woman}$) results in the embedding of the word "man". However, it does not always result in exactly the same word embedding. The result of the formula is called the expected or the predicted word embedding. In this context, "man" is called the true or the actual word $t$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.  

You have to answer an analogy function using each embedding for both CBOW and Skipgram model. This means that for each analogy we have 6 outputs. Show the true word (with distance similarity value between predicted embedding and true word embedding, i.e. `sim1`) , the predicted word (with distance similarity value between predicted embedding and the embedding of the word in the vocabulary that is closest to this predicted embedding, i.e. `sim2`) and a boolean answer whether the predicted word **exactly** equals the true word. 

<b>HINT</b>: to visualize the results of the analogy tasks , you can print them in a table. An example is given below.


| Analogy task | True word (sim1)  | Predicted word (sim2) | Embedding | Correct?|
|------|------|------|------|------|
|  queen is to king as woman is to ?	 | man (sim1) | predictd_word(sim2) | SG_50 | True / False|

* Give at least 5 different  examples of analogies.
* Compare the performance on the analogy s between the word embeddings and briefly discuss your results.

In [0]:
def cosine_similarity(
    v: Union[np.ndarray, tf.Tensor],
    w: Union[np.ndarray, tf.Tensor]
    ) -> float:
  """v and w should be 1D-vectors, for which to calculate the cosine similarity"""
  
  assert isinstance(v, tf.Tensor) or isinstance(v, np.ndarray), "v must be tf.Tensor or np.ndarray"
  assert isinstance(w, tf.Tensor) or isinstance(w, np.ndarray), "w must be tf.Tensor or np.ndarray"
  
  v = v.numpy() if isinstance(v, tf.Tensor) else v
  w = w.numpy() if isinstance(w, tf.Tensor) else w
  
  if v.ndim > 1:
      v = v.flatten()
  if w.ndim > 1:
      w = w.flatten()

  return v @ w.T / (np.linalg.norm(v) * np.linalg.norm(w))

def score_analogy(
    analogy: np.ndarray,
    answer: np.ndarray,
    file_names: np.ndarray):
  
  question = '{} is to {} as {} is to ?'.format(analogy[0], analogy[1], analogy[2])
  dicts = [load_embeddings(f) for f in file_names]                              # dictionaries of mebeddings
  preds = [d[analogy[1]] - d[analogy[0]] + d[analogy[2]] for d in dicts]        # computes analogies
  sim = []                                                                      # similarity scores
  words = []                                                                    # closest word to preds 

  for p, d in zip(preds, dicts):
    closest_sim = -10
    closest_word = ''
    for key in d.keys():
      tmp_sim = cosine_similarity(p, d[key])
      if tmp_sim > closest_sim:
        closest_word = key
        closest_sim = tmp_sim
    sim.append(closest_sim)
    words.append(closest_word)

  label = [True for p, w in zip(preds, words) if p == w else False]
   

In [0]:
analogy = ['queen', 'king', 'woman']
answer = ['man']
file_names = ['skp_50', 'skp_100', 'skp_150']

In [49]:
x = [1,2,3,4,5]
y = [1,1,3,4,4]
x == y

False

## Task 1.4 - Discussion
Answer the following question:
* Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


# Question 2 - Peer review (0 pt):
Finally, each group member must write a single paragraph outlining their opinion on the work distribution within the group. Did every group member
contribute equally? Did you split up tasks in a fair manner, or jointly worked through the exercises. Do you think that some members of your group deserve a different grade from others? You can use the table below to make an overview of how the tasks were divided:



| Student name | Task  |
|------|------|
|  student name 1  | task x |
| student name 2  | task x|
| everyone | task x|
