# Assignment 1

<b>Group [96]</b>
* <b> Student 1 </b> : YU-WEN HUANG + 1513753

**Reading material**
* [1] Mikolov, Tomas, et al. "[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)" Advances in neural information processing systems. 2013. 

<b><font color='red'>NOTE</font></b> When submitting your notebook, please make sure that the training history of your model is visible in the output. This means that you should **NOT** clean your output cells of the notebook. Make sure that your notebook runs without errors in linear order.



# Question 1 - Keras implementation (10 pt)

### Word embeddings
Build word embeddings with a Keras implementation where the embedding vector is of length 50, 150 and 300. Use the Alice in Wonderland text book for training. Use a window size of 2 to train the embeddings (`window_size` in the jupyter notebook). 

1. Build word embeddings of length 50, 150 and 300 using the Skipgram model
2. Build word embeddings of length 50, 150 and 300 using CBOW model
3. Analyze the different word embeddings:
    - Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in the paper. Do not use existing libraries for this task such as Gensim. 
Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. $e_{x}$ denotes the embedding of word $x$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.
    - Give at least 5 different  examples of analogies.
    - Compare the performance on the analogy tasks between the word embeddings and briefly discuss your results.

4. Discuss:
  - Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


<b>HINT</b> See practical 3.1 for some helpful code to start this assignment.


### Import libraries

In [0]:
%tensorflow_version 2.x

In [5]:
import numpy as np
import keras.backend as K
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Reshape, Lambda, Flatten
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.optimizers import Nadam, Adadelta


# other helpful libraries
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as nn
from matplotlib import pylab
import pandas as pd

Using TensorFlow backend.


In [6]:
print(tf.__version__) #  check what version of TF is imported

2.2.0


### Import file

If you use Google Colab, you need to mount your Google Drive to the notebook when you want to use files that are located in your Google Drive. Paste the authorization code, from the new tab page that opens automatically when running the cell, in the cell below.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Navigate to the folder in which `alice.txt` is located. Make sure to start path with '/content/drive/My Drive/' if you want to load the file from your Google Drive.

In [8]:
cd '/content/drive/My Drive/Colab Notebooks/DL course/'

/content/drive/My Drive/Colab Notebooks/DL course


In [0]:
file_name = 'alice.txt'
corpus = open(file_name).readlines()

### Data preprocessing

See Practical 3.1 for an explanation of the preprocessing steps done below.

In [0]:
# Removes sentences with fewer than 3 words
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

# remove punctuation in text and fit tokenizer on entire corpus
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

# convert text to sequence of integer values
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus) # total number of words in the corpus
V = len(tokenizer.word_index) + 1 # total number of unique words in the corpus

In [8]:
n_samples, V

(27165, 2557)

In [9]:
# example of how word to integer mapping looks like in the tokenizer
print(list((tokenizer.word_index.items()))[:5])

[('the', 1), ('and', 2), ('to', 3), ('a', 4), ('it', 5)]


In [0]:
# parameters
window_size = 2
window_size_corpus = 4



## Task 1.1 - Skipgram
Build word embeddings of length 50, 150 and 300 using the Skipgram model.

In [14]:
#prepare data for skipgram
# Hyper parameters for the models
epochs = 10
batch_size = 64
dims = [50, 150, 300]
# def generate_data_skipgram(corpus, window_size, V):
    # TODO Implement here
    # HINT: see Practical 3.1

def generate_skipgram(corpus, window_size, V):
    maxlen = window_size*2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1
            for i in range(p, n):
                if i != index and 0 <= i < L:
                    all_in.append(word)
                    all_out.append(to_categorical(words[i], V))
                                      
    return (np.array(all_in),np.array(all_out))

x_skipgram , y_skipgram = generate_skipgram(corpus,window_size,V)
x_skipgram.shape, y_skipgram.shape

((94556,), (94556, 2557))

In [0]:
# create training data
x , y = generate_skipgram(corpus,window_size,V)

In [16]:
# create skipgram architecture
skipgram_models = {}
# save embeddings for vectors of length 50, 150 and 300 using skipgram model
dims = [50, 150, 300]
for dim in dims:
    # skipgram = Sequential([
    #         Embedding(input_dim=V, output_dim=dim, input_length=1, embeddings_initializer="glorot_uniform", name='embedding'),Flatten(),
    #         Dense(V, activation="softmax", kernel_initializer="glorot_uniform")
    # ])

    skipgram = Sequential()
    skipgram.add(Embedding(input_dim=V, output_dim=dim, embeddings_initializer='glorot_uniform', input_length=1,name='embedding'))
    skipgram.add(Reshape((dim, )))
    skipgram.add(Dense(V, kernel_initializer='glorot_uniform', activation='softmax'))
    optim = Nadam(learning_rate=4e-3)
    skipgram.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy'])
    # train skipgram model
    skipgram.fit(x_skipgram, y_skipgram, batch_size=batch_size, epochs=epochs, validation_split=0)
    model_name = f"skipgram_{dim}"
    skipgram_models[model_name] = skipgram
    # The embedding matrix is saved in the weights of the model
    weights = skipgram.get_weights()
    embedding = weights[0]
    temp = embedding.shape[1]
    columns = ["word"] + [f"features_{i + 1}" for i in range(temp)]
    with open(f"vectors_{model_name}.txt" ,'w') as f:
        f.write(",".join(columns))
        f.write("\n")
        for word, i in tokenizer.word_index.items():
            f.write(word)
            f.write(",")
            f.write(",".join(map(str, list(embedding[i,:]))))
            f.write("\n")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<b>HINT</b>: To increase training speed of your model, you can use the free available GPU power in Google Colab. Go to `Edit` --> `Notebook Settings` --> select `GPU` under `hardware accelerator`.

## Task 1.2 - CBOW

Build word embeddings of length 50, 150 and 300 using CBOW model.

In [18]:
# prepare data for CBOW
def generate_cbow(corpus, window_size, V):
    maxlen = window_size*2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1
            temp = []
            all_out.append(to_categorical(word, V))
            for i in range(p, n):
                if i != index:
                    if 0 <= i < L:
                        temp.append(words[i])
                    else:
                        temp.append(0)
            all_in.append(temp)
                                      
    return (np.array(all_in), np.array(all_out))

x_cbow , y_cbow = generate_cbow(corpus,window_size,V)
x_cbow.shape, y_cbow.shape

# create training data
# create CBOW architecture

cbow_models = {}

# save embeddings for vectors of length 50, 150 and 300 using CBOW model
dims = [50, 150, 300]
for dim in dims:
    skipgram = Sequential()
    skipgram.add(Embedding(input_dim=V, output_dim=dim, embeddings_initializer='glorot_uniform', input_length=2 * window_size,name='embedding'))
    skipgram.add(Reshape((dim, )))
    skipgram.add(Dense(V, kernel_initializer='glorot_uniform', activation='softmax'))
    optim = Nadam(learning_rate=4e-3)
    # optim = Adadelta(learning_rate=4e-3)
    cbow.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy'])
    # train CBOW model
    cbow.fit(x_cbow, y_cbow, batch_size=batch_size, epochs=epochs, validation_split=0)
    model_name = f"cbow_{dim}"
    cbow_models[model_name] = cbow
    # The embedding matrix is saved in the weights of the model
    weights = cbow.get_weights()
    embedding = weights[0]
    temp = embedding.shape[1]
    columns = ["word"] + [f"features_{i + 1}" for i in range(temp)]
    with open(f"vectors_{model_name}.txt" ,'w') as f:
        f.write(",".join(columns))
        f.write("\n")
        for word, i in tokenizer.word_index.items():
            f.write(word)
            f.write(",")
            f.write(",".join(map(str, list(embedding[i,:]))))
            f.write("\n")


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Task 1.3 - Analogy function

Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in [1]. Do not use existing libraries for this task such as Gensim. Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. 

In a perfect scenario, we would like that this analogy ( $e_{king} - e_{queen} + e_{woman}$) results in the embedding of the word "man". However, it does not always result in exactly the same word embedding. The result of the formula is called the expected or the predicted word embedding. In this context, "man" is called the true or the actual word $t$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.  

You have to answer an analogy function using each embedding for both CBOW and Skipgram model. This means that for each analogy we have 6 outputs. Show the true word (with distance similarity value between predicted embedding and true word embedding, i.e. `sim1`) , the predicted word (with distance similarity value between predicted embedding and the embedding of the word in the vocabulary that is closest to this predicted embedding, i.e. `sim2`) and a boolean answer whether the predicted word **exactly** equals the true word. 

<b>HINT</b>: to visualize the results of the analogy tasks , you can print them in a table. An example is given below.


| Analogy task | True word (sim1)  | Predicted word (sim2) | Embedding | Correct?|
|------|------|------|------|------|
|  queen is to king as woman is to ?	 | man (sim1) | predictd_word(sim2) | SG_50 | True / False|

* Give at least 5 different  examples of analogies.
* Compare the performance on the analogy s between the word embeddings and briefly discuss your results.

In [41]:
extract_embedding = lambda model: model.get_weights()[0]
model_names = list(skipgram_models.keys()) + list(cbow_models.keys())
embeddings = [extract_embedding(v) for _, v in skipgram_models.items()] + [extract_embedding(v) for _, v in cbow_models.items()]

# Define Closest word function
def closest_word(model_name, embedded_word, metric="cosine"):
    df = pd.read_csv(f"vectors_{model_name}.txt", sep=",")
    words = list(df["word"])
    embedded_words = df.iloc[:, 1:].values.astype(np.float)
    embedded_word = embedded_word.reshape(1, -1)
    sims = cosine_similarity(embedded_word, embedded_words).reshape(-1)
    idx = np.argmax(sims)
    return words[idx], sims[idx]
    
#embedding function
def embed(word, embedding=embedding, vocab_size = V, tokenizer=tokenizer):
    # get the index of the word from the tokenizer, i.e. convert the string to it's corresponding integer in the vocabulary
    int_word = tokenizer.texts_to_sequences([word])[0]
    # get the one-hot encoding of the word
    bin_word = to_categorical(int_word, V)
    return np.dot(bin_word, embedding)

#test
analogies = [('queen', 'king', 'woman', 'man')]
# analogies = [('flower','Milk','Bread','egg')]
# analogies = [('Car','Driver','Road','Walker')]
# analogies = [('Dress','Girl','Skirt','Lady')]
# analogies = [('Studio','Single','Apartment','Family')]

for queen, king, woman, man in analogies:
    task = f"{queen} is to {king} as {woman} is to ?"
    df = []
    for model_name, embedding in zip(model_names, embeddings):
        emb1, emb2, emb3, true_emb = embed(queen, embedding), embed(king, embedding), embed(woman, embedding), embed(man, embedding)
        predicted_emb = emb2 - emb1 + emb3
        sim1 = cosine_similarity(true_emb.reshape(1, -1), predicted_emb.reshape(1, -1)).reshape(-1)[0]
        result1 = f"{man}({sim1})"
        predicted_word, sim2 = closest_word(model_name, predicted_emb)
        result2 = f"{predicted_word}({sim2})"
        result3 = str(predicted_word == man)
        vals = {"Analogy Task": task, "True Word(sim1)": result1, "Predicted Word(sim2)": result2, "Embedding": model_name, "Correct?": result3}
        df.append(vals)

df = pd.DataFrame(df)
df

Unnamed: 0,Analogy Task,True Word(sim1),Predicted Word(sim2),Embedding,Correct?
0,queen is to king as woman is to ?,man(0.40918177366256714),woman(0.5641658265001745),skipgram_50,False
1,queen is to king as woman is to ?,man(0.27073967456817627),woman(0.6491862453459567),skipgram_150,False
2,queen is to king as woman is to ?,man(0.25881877541542053),woman(0.6510149400331547),skipgram_300,False
3,queen is to king as woman is to ?,man(0.2200305014848709),woman(0.6212387719623047),cbow_50,False
4,queen is to king as woman is to ?,man(0.2200305014848709),woman(0.6413854630029787),cbow_150,False
5,queen is to king as woman is to ?,man(0.2200305014848709),woman(0.6465533679559283),cbow_300,False


##Discussion

In this model, all the results come to a False. Therefore, in this case, these two are not comparable. Yet, according to the understanding of the two approaches, they have the following characteristics. First of all, CBOW is comparatively faster to train than skip-gram. Secondly, CBOW is better for frequently occurring words while kip-gram works well for less frequently occurring words than CBOW. Third, Skip-gram is slower but works well for the smaller amount of data then CBOW. 



## Task 1.4 - Discussion
Answer the following question:
* Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?

skip-gram takes more number of training samples. First, in CBOW we were trying to predict the focus words given the context words whereas in skip-gram we are trying to do the opposite we try to predict the context words given the focus word.
In both the number of parameters to train but in CBOW we had one softmax to train and whereas in skip-gram we have k softmax, hence skip-gram takes more time than CBOW, so, it is computationally more expensive.

# Question 2 - Peer review (0 pt):
Finally, each group member must write a single paragraph outlining their opinion on the work distribution within the group. Did every group member
contribute equally? Did you split up tasks in a fair manner, or jointly worked through the exercises. Do you think that some members of your group deserve a different grade from others? You can use the table below to make an overview of how the tasks were divided:



| Student name | Task  |
|------|------|
| YU-WEN HUANG | I work alone. |
