# Tutorial 5 - Word Embeddings

In the lectures, so far you have covered NLP processing of textual data and word embeddings. Therefore, in today's notebook we will revisit the algorithm Word2Vec (W2V) that you can use to learn word embeddings with two models, i.e., CBOW and Skip-Gram. While the two flavours of W2V differ in how they are modeling token embeddings, both models aim at producing dense numerical vector representations, that capture the semantic relationships in the input textual samples. Once we have covered how to create pretrained embedding dictionaries compatible with keras, we will generate two-dimensional representation of the trained word embeddings that is suitable for plotting purposes. Afterward, we will use the embeddings from W2V in an MLP-based network trained for sentiment classification. <br>

Here is the outline of today's notebook:
*   Word2Vec: Implementation of an Embedding Layer Dictionary with CBOW and Skip-Gram (Demo). 
*   Plotting Embeddings using t-SNE (Exercise 1).
*   MLP-based Neural Network with W2V Embeddings for Sentiment Classification (Exercise 2).


In [None]:
import sys
print(sys.executable)

In [None]:
# required packages:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from bs4 import BeautifulSoup
import re
from sklearn.model_selection import train_test_split
import pickle
from sklearn.metrics import recall_score,precision_score,roc_auc_score
import numpy as np
from keras.layers import TextVectorization
from gensim.models import Word2Vec 
import time
import keras
from keras import Sequential
import seaborn as sns
import matplotlib.pyplot as plt

## **1. Word2Vec: Implementation of an Embedding Layer Dictionary with CBOW and Skip-Gram** (Demo)<br>

We use W2V to train word embeddings, which capture the contextual meaning of individual tokens:<br>
<img src="https://github.com/Humboldt-WI/demopy/raw/main/W2V.PNG" width="1140" height="610" alt="W2V">

In W2V, each word is represented by two vectors of dimension *𝑑*, as words take both roles, i.e., context and target word. The word vectors are the parameters of the neural network, which we train on our corpus of textual data for language modeling purposes. The goal is to learn low-dimensional, dense representation of words as numerical continuous vectors, which enable ML models to understand the meaning and the semantics of words algorithmically. Therefore, language modeling can be regarded as the upstream task, whereas, e.g., sentiment classification using the pretrained embeddings from W2V would be the downstream task. There are two variants of W2V, which we can use to obtain word embeddings: <br>

<img src="https://github.com/Humboldt-WI/demopy/raw/main/CBOW_and_SkipGram.PNG" width="1450" height="390" alt="CBOW_and_SkipGram">

The Continuous Bad of Word (CBOW) model differs from the Skip-Gram model mainly in that the former predicts the center word from surrounding context tokens, whereas the latter predicts the context from the center word. The surrounding words are quantified by defining a context window. In the above visualization, the context window is set to 2 tokens. The training process involves parsing the textual samples with context size 2, and sliding the training tuples consisting of inputs and targets until the end of the sentences. Language modeling does not necessitate target labels or text annotation, as the training inputs and outputs are obtained from parsing the textual samples. Thus, we call such training process self-supervised. W2V in its two flavours is trained with a shallow neural network. For simplicity purposes, assume you have a single token as the input and the following word as the target:<br>
<img src="https://github.com/Humboldt-WI/demopy/raw/main/W2V_Shallow_Network_Example.PNG" width="1450" height="410" alt="W2V_Shallow_Network_Example">

The input layer of the network represents a one-hot-vector, which has a positive value at the index of the input word. The dot product of the one-hot-vector with the continuous trainable weights of the shallow network amounts to indexing the vector of weights in the hidden layer associated with the input word *can*. This is also the vector of weights that we will use as our word embeddings in the downstream task. If we choose to train our own embeddings using W2V instead of downloading pretrained embeddings, then, first, we would have to clean our dataset. Thus, let's import the IMDB dataset, and clean it using our `NLP_preprocessing_pipeline` function from the previous tutorial notebook. While previously the function returned a list of cleaned tokens, in this notebook we will design our function to return the whole cleaned string of the textual samples, as we will later feed these cleaned strings to a `TextVectorization` layer from `keras`:<br>

In [None]:
# Lemmatize with POS Tag (Parts of Speech tagging)
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word:str)->str:
    """Map POS tag to first character for lemmatization

    Returns:
    --------
    pos: str
        The positional tag of speech retrieved from wordnet database.
    """

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
                
    pos=tag_dict.get(tag,wordnet.NOUN)

    return pos

def NLP_preprocessing_pipeline(textual_sample:str)->list:
    '''
    Implements 7 steps of an NLP preprocessing pipeline.

    Parameters:
    -----------
    textual_sample:str
        The input text that requires preprocessing

    Returns:
    --------
    preprocessed_textual_sample:str
        The textual sample after each of the 7 preprocessing steps have been applied.

    '''
    lemmatizer = WordNetLemmatizer()

    #Removing of URLs:
    preprocessed_textual_sample = re.sub("http\S+", "",textual_sample)

    #Removing of HMTL tags:
    preprocessed_textual_sample = BeautifulSoup(preprocessed_textual_sample).get_text()

    #Removing of non-alphabetic characters:
    preprocessed_textual_sample = re.sub("[^a-zA-Z]", " ",preprocessed_textual_sample)

    #Changing all tokens to lower case:
    preprocessed_textual_sample = preprocessed_textual_sample.lower()

    #Tokenization:
    preprocessed_textual_sample=nltk.word_tokenize(preprocessed_textual_sample)

    #Stopwords removal:
    preprocessed_textual_sample = [w for w in preprocessed_textual_sample if w not in stopwords.words("english")]

    #Lemmatization:
    preprocessed_textual_sample=[lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in preprocessed_textual_sample]
    preprocessed_textual_sample=' '.join(preprocessed_textual_sample)
    
    return preprocessed_textual_sample

In [None]:
df = pd.read_csv("IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1").iloc[:5000,:]
df.head(10)

In [None]:
X=df['review'].apply(NLP_preprocessing_pipeline)

In [None]:
#We will save the cleaned version of our movie reviews as the preprocessing takes a while to complete.
with open('cleaned_X_str.pkl','wb') as f:
    pickle.dump(X,f)

In [None]:
with open('cleaned_X_str.pkl','rb') as f:
    X=pickle.load(f)

In [None]:
X[:10]

Next, we will build our vocabulary with the `TextVectorization` layer. The latter transforms the input strings to a list of integer token indices, which are associated with a unique word in our vocabulary. We learn the vocabulary by calling the function `adapt`. When the layer is adapted, it learns the frequency of the individual tokens in the dataset. If we specify a maximum size of our vocabulary, e.g., 15k, then the layer would create a vocabulary containing the 15k most frequently encountered words in the cleaned textual samples. Words outside of this vocabulary get mapped to the UNK-token, i.e., out-of-vocabulary token. For consistency purposes, we will also specify the maximal sequence length for each textual sample. For instance, if we set the sequence length to a maximum of 100 tokens, then textual samples with less tokens will get padded with zeros to a length of 100. Similarly to other preprocessing techniques that we have applied so far in this semester, the layer `TextVectorization` is adapted on the train set only. For this reason, we will split our cleaned data *X* into train and test set before the learning of the vocabulary takes place. 

In [None]:
#Split into train and test subsets:
y=df['sentiment'].map({'positive':1,'negative':0}).values  # map text-based class labels to numbers
Xclean_train, Xclean_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 5)  # data partitioning

# Create a vectorization layer
vocab_size = 15000
seq_length = 100
vectorize_layer = TextVectorization(
    standardize = None,
    #since we have cleaned our data already, 
    # we pass None to the text standardization parameter.
    max_tokens = vocab_size,
    output_sequence_length = seq_length)

#Learn the vocabulary on the train set:
vectorize_layer.adapt(Xclean_train)

#print the first 15 tokens in the vocabulary:
vocab = vectorize_layer.get_vocabulary()
print(vocab[:15])

The output from the vectorization layer is a list of integers indicating the position of the words in the vocabulary with the most frequent tokens. Let's take the first clean sample in our test set, and transform it using the `TextVectorization` layer that we have already adapted on our training data:

In [None]:
print('The first clean textual sample in the test dataset:')
print(Xclean_test.iloc[0],'\n')
print('The corresponding sentiment label: ',y_test[0])

In [None]:
print('The output from the Text Vectorization: ')
vectorize_layer(Xclean_test.iloc[0])

The zeros correspond to an empty token resulting from the padding process, and the index 1 corresponds to the UNK token. Since we are interested in training our own embeddings with W2V, we will first vectorize our train textual samples. We will then use the output to retrieve the corresponding words in string format from our vocabulary, which would serve as the input to W2V. In this way, we would generate embeddings only for the 15k most frequent tokens in our vocabulary. If we feed the cleaned textual samples to W2V without the text vectorization step, then W2V might produce embeddings also for tokens that get mapped to the UNK token during the training process. 

In [None]:
X_train_vec=vectorize_layer(Xclean_train)
X_train_words = [[vocab[w] for w in rev if vocab[w] not in ['','[UNK]']] for rev in X_train_vec]
#we collect all train words except '' and '[UNK]', as we do not have to learn embeddings for these two tokens.
#When we create our keras embedding layer, we will overwrite all embeddings with those learned with W2V except the embeddings
# for the first two tokens. The embeddings of the latter will be randomly initialized, as the tokens '' and '[UNK]' do not have 
# a specific contextual meaning.  

In [None]:
#Since this can take a while, we will save result so that we can import it later:
with open('X_train_words.pkl','wb') as f:
    pickle.dump(X_train_words,f)

In [None]:
with open('X_train_words.pkl','rb') as f:
    X_train_words=pickle.load(f)

Next, we will learn the word embeddings using both CBOW and Skip-Gram. Afterward, we will create embedding layers, the continuous vectors of which will be overwritten with the embeddings from W2V:

In [None]:
# Train CBOW and SkipGram:
#Keep track of the training time:
embeddings_dimension=300
start_time_cbow=time.time()
w2v_model_cbow = Word2Vec(X_train_words, min_count=100, #we set the minimal token frequency to 100 to avoid creating embeddings for 
                        #rarely represented words
                        window=10, #the size of context
                        epochs=50,  
                        vector_size=embeddings_dimension, #size of embedding
                        workers=4,#for parallel computing
                        sg  = 0)  
end_time_cbow=time.time()

start_time_skipgram=time.time()
w2v_model_skipgram = Word2Vec(X_train_words, min_count=100, window=10,     
                 epochs=50,  vector_size=embeddings_dimension, workers=4, sg  = 1)  

end_time_skipgram=time.time()
print('Training Time with CBOW in Seconds: ',round(end_time_cbow-start_time_cbow,5))
print('Training Time with SkipGram in Seconds: ',round(end_time_skipgram-start_time_skipgram,5))  


#Create corresponding Embedding Layers:
embedding_layer_cbow=keras.layers.Embedding(vocab_size,embeddings_dimension)
embedding_layer_cbow.build(input_shape=(embeddings_dimension))

embedding_layer_skipgram= keras.layers.Embedding(vocab_size,embeddings_dimension)
embedding_layer_skipgram.build(input_shape=(embeddings_dimension))



In [None]:
# Alternatively to training your own embeddings, you can download pretrained embeddings usign the package gensim:
# import gensim.downloader
# w2v_model_cbow=gensim.downloader.load('word2vec-google-news-300')

In [None]:
#Overwrite the weights in the embedding layers with the continuous  token representation learned with CBOW and Skip-Gram:
 
embeddings_cbow=[]
embeddings_skipgram=[]

#Fill in the embedding matrices:
for token_idx in range(0,len(vocab)):
    if token_idx>1:
        if vocab[token_idx] in w2v_model_cbow.wv:
            embeddings_cbow.append(w2v_model_cbow.wv[vocab[token_idx]])
        else:#take embedding corresponding to UNK token:
            embeddings_cbow.append(embedding_layer_cbow.get_weights()[0][1])

        
        if vocab[token_idx] in w2v_model_skipgram.wv:
            embeddings_skipgram.append(w2v_model_skipgram.wv[vocab[token_idx]])
        else:#take embedding corresponding to UNK token:
            embeddings_skipgram.append(embedding_layer_skipgram.get_weights()[0][1])

    else:#then take mebeddings of ''-token or '[UNK]'-token:
        embeddings_cbow.append(embedding_layer_cbow.get_weights()[0][token_idx])
        embeddings_skipgram.append(embedding_layer_skipgram.get_weights()[0][token_idx])

#Overwrite the weights in the embedding layers with the corresponding W2V token embeddings.
#You will need the embedding_layer_cbow and embedding_layer_skipgram for the 1. exercise.
embedding_layer_cbow.set_weights(np.array([embeddings_cbow]))
embedding_layer_skipgram.set_weights(np.array([embeddings_skipgram]))

**Demo Summary**:<br>
- we can use W2V to train our own embeddings with two models, i.e., CBOW and Skip-Gram.<br>
- the inputs to W2V are the cleaned tokenized textual samples coming from the vocabulary built with the text vectorization layer. The latter creates the vocabulary from the set of most frequently encountered words in our cleaned dataset.
- once we have trained our embeddings, we save each continuous vector corresponding to a token in our vocabulary in embedding matrices that we use to overwrite the weights of the embedding layers in `keras`. These embedding layers serve as a dictionary to look up words based on their indices produced from the text vectorization transformation.
- we can use these embedding layers when implementing neural networks in `keras` to process text algorithmically. Also, we can use the output of the embedding layers as inputs to dimensionality reduction techniques that produce low dimensional vectors for plotting purposes.

## **2. Plotting Embeddings using t-SNE** (Exercise 1)<br>
T-distributed Stochastic Neighbor Embedding (t-SNE) is a tool to generate low dimensional data, e.g., in a 2D space, from high dimensional embeddings for visualization purposes. The transformation process of t-SNE preserves the structure of the data in the low dimensional space, which enables the interpretability of the original high dimensional vectors. `Sklearn` provides you with easy access to the implementation of t-SNE (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). However, `sklearn`'s implementation is quite costly and can consume a lot of memory especially if the original data comes in a high dimensional space such as the embeddings from W2V. Therefore, in this task you will use a less computationally expensive version of t-SNE implemented in the package `openTSNE`. For this purpose, you would have to install the package with pip first. Once you have completed the installation of `openTSNE`, you are tasked with the following:<br>
- take the first 500 samples from the cleaned `Xclean_train` dataset, and transform them with the `vectorize_layer` from the demo.
- feed the vectorized output to the embedding layers containing the CBOW and Skip-Gram weights from the demo (`embedding_layer_cbow` and `embedding_layer_skipgram`), convert the Tensor to numpy format, and flatten the token sequences to obtain vectors of the dimensionality 300 (embedding dimension) * 100 (sequence length).
- fit `TSNE` from `openTSNE` on the cbow and skip-gram embeddings, and store the resulting two components together with the target labels from the train set (`y_train`) in two dataframes. Name the columns of the dataframes X-Axis, Y-Axis and Sentiment Class.
- use the package `seaborn` to plot two well-annotated scatter plots next to each other for each embedding type. Use the parameter hue from `seaborn`'s scatter plot to plot the relationship of the two-dimensional t-SNE embeddings to the sentiment class. You can check out some examples for plotting with seaborn under the following link: https://seaborn.pydata.org/generated/seaborn.scatterplot.html.  
- provide a short interpretation of the visualization: what pattern do you observe? what could be causing this pattern? in which scenario would you observe a different pattern?   

In [None]:
#!pip install openTSNE
from openTSNE import TSNE
num_samples=500
#Get the embeddings using the vectorize layer and the embeddings layers from the demo:
cbow_reviews_train=...
#flatten the last two dimensions:
cbow_reviews_train=...
cbow_tsne_embeddings = TSNE().fit(...)

cbow_tsne_embeddings=pd.DataFrame({...})

skipgram_reviews_train=...
#flatten the last two dimensions:
skipgram_reviews_train=...
skipgram_tsne_embeddings = TSNE().fit(...)
skipgram_tsne_embeddings=pd.DataFrame({...})

In [None]:
fig,ax=plt.subplots(ncols=2,nrows=1,figsize=(20,5))
sns.scatterplot(x=...,y=...,data=cbow_tsne_embeddings,hue=...,ax=ax[0])
ax[0].set_xlabel(...)
ax[0].set_ylabel(...)
ax[0].legend(mode='expand',ncols=2,loc=[0.0,-0.3],title='Sentiment Class')
ax[0].set_title(...)

sns.scatterplot(x=...,y=...,data=skipgram_tsne_embeddings,hue=...,ax=ax[1])
ax[1].set_xlabel(...)
ax[1].set_ylabel(...)
ax[1].set_title(...)

plt.show()
plt.close()


**Interpretation**:<br>
... 

## **3. MLP-based Neural Network with W2V Embeddings for Sentiment Classification** (Exercise 2)<br>
In this exercise, you are tasked with the implementation of two MLP-based neural networks for sentiment classification on the IMDB dataset using the W2V embeddings that we generated with CBOW and Skip-Gram in the demo part of this notebook:<br>
- build two sequential keras models (cbow_mlp_model and skipgram_mlp_model), which consists of six layers: <br>
an input layer with shape = (1) and  dtype='string', the `vectorize_layer` from the demo, an embedding layer initialized with the vocabulary size = 15k, the embedding dimension = 300 and the input sequence length = 100, a GlobalAveragePooling1D layer, a GELU-based hidden layer with 10 units and a final Sigmoid-based prediction layer.<br>

- after compiling the models with the Adam optimizer and the binary crossentropy loss, overwrite the weights in the embedding layers with the embedding matrices from the demo `embeddings_cbow` and `embeddings_skipgram` by using the function `set_weights()`.
- make two runs with each model for 10 epochs with a batch size of 32: in the first run freeze the W2V weights, in the second run set the weights in the embedding layers to trainable. You should use the cleaned version of the train and test datasets from the demo when fitting the models, i.e, `Xclean_train` and `Xclean_test`.

- report the results in terms of AUC score, precision and recall score in a tabular overview. You can use the threshold of 0.5, so that you convert the keras probabilities into binary classes for the computation of the precision and recall scores.  

In [None]:
predictions_results_frame=[]

for nr_run in [0,1]:
  #Create the models: based on the run (0 or 1) set the trainable parameter of the Embedding layer to True or False
  cbow_mlp_model = Sequential([...])

  skipgram_mlp_model = Sequential([...])

  #Compile the models and overwrite the embedding layer weights with the embeddings from W2V:
  cbow_mlp_model.compile(...)
  cbow_mlp_model.layers[1].set_weights(np.array([embeddings_cbow]))
  
  skipgram_mlp_model.compile(...)
  skipgram_mlp_model.layers[1].set_weights(np.array([embeddings_skipgram]))

  #Fit the cbow-based model and make predictions:
  cbow_mlp_model.fit(Xclean_train.to_numpy().reshape(-1,1), 
                     y_train,
                     ...)

  cbow_proba_pr=cbow_mlp_model.predict(Xclean_test.to_numpy().reshape(-1,1),verbose=0).flatten()
  cbow_class_pr=np.where(...)
  predictions_results_frame.append([...])

  #Do the same for the skipgram-based MLP:
  skipgram_mlp_model.fit(Xclean_train.to_numpy().reshape(-1,1), 
                         y_train,
                         ...)

  skipgram_proba_pr=skipgram_mlp_model.predict(Xclean_test.to_numpy().reshape(-1,1),verbose=0).flatten()
  skipgram_class_pr=np.where(...)
  predictions_results_frame.append([...])

#Put the results in a dataframe:
results_overview=pd.DataFrame(np.around(np.array(predictions_results_frame),3),columns=['Recall Score','Precision Score','ROC Auc Score'],
            index=[...])#name the indices in such way that it is clear which run was using trainable vs. non-trainable embedding layers

In [None]:
results_overview.sort_index()