<img src='data/images/lecture-notebook-header.png' />

# Data Preparation: Word Embeddings (Word2Vec)

Word embeddings are dense vector representations of words in a mathematical space, typically in high-dimensional space, where each dimension captures a different aspect of the word's meaning. These representations are learned from large amounts of text data using techniques like neural networks, specifically models like Word2Vec, GloVe, and fastText. The basic idea behind word embeddings is that words with similar meanings or usage patterns tend to occur in similar contexts. Therefore, by training a model to predict a word based on its surrounding words or predicting the surrounding words given a target word, the model can learn to encode semantic and syntactic relationships between words. Word embeddings have become a fundamental component in various NLP tasks, including sentiment analysis, machine translation, text classification, and information retrieval. They enable algorithms to better understand and work with textual data by representing words as continuous vectors with rich semantic information.

Word embeddings are trained using unsupervised learning techniques on large amounts of text data. There are two popular approaches for training word embeddings: the count-based approach and the predictive approach.

* **Count-based Approach:** In this approach, the co-occurrence statistics of words within a context window are calculated from the corpus. The context window is a fixed-size window of words surrounding the target word. The intuition behind this approach is that words that have similar contexts tend to have similar meanings.

    * One popular count-based algorithm is GloVe (Global Vectors for Word Representation). GloVe constructs a co-occurrence matrix that captures the statistics of word co-occurrences in a corpus. It then factorizes this matrix to obtain word embeddings that encode the relationships between words based on their co-occurrence patterns.

    * Another approach is LSA (Latent Semantic Analysis), which applies Singular Value Decomposition (SVD) on the co-occurrence matrix to obtain word embeddings.

* **Predictive Approach:** The predictive approach uses neural network models to predict a target word based on its context or vice versa. These models are trained to minimize the prediction error and learn meaningful representations.

    * The Word2Vec model, specifically the Skip-gram and Continuous Bag-of-Words (CBOW) architectures, are popular predictive models. Skip-gram aims to predict the surrounding words given a target word, while CBOW predicts the target word based on its context words.

    * Recent models like ELMo (Embeddings from Language Models) and BERT (Bidirectional Encoder Representations from Transformers) use deep transformer architectures and contextual embeddings. These models learn word representations based on the entire context of a sentence, capturing more nuanced meanings.

During training, the word embeddings are updated iteratively using techniques like stochastic gradient descent or negative sampling. The objective is to minimize a loss function that measures the discrepancy between predicted and actual words in the given context.

In this course, we have a detailed look how **Word2Vec** embeddings are trained. Training word embeddings from scratch requires very large text corpora and is therefore very time and resource-intensive. Since we want to focus on the underlying approach, we train Word2Vec embeddings on small and domain-specific dataset. The purpose of this notebook is the prepared the text corpus -- the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) -- to serve as training dataset for both implementations of Word2Vec: CBOW and Skip-gram. As a reminder, the figure below , taken from the lecture slides shows the basic setup for both implementations:

<img src='data/images/lecture-slide-06.png' width='80%' />

## Setting up the Notebook

### Import Required Packages

In [None]:
import re
import os

import numpy as np
from tqdm import tqdm
from collections import Counter, OrderedDict

We utilize some utility methods from PyTorch as well as Torchtext, so we need to import the `torch` and `torchtext` package.

In [None]:
import torch
import torchtext
from torchtext.vocab import vocab

As usual, we rely on spaCy to perform basic text preprocessing and cleaning steps, mainly tokenization and lemmatization.

In [None]:
import spacy

# Tell spaCy to use the GPU (if available)
spacy.prefer_gpu()

nlp = spacy.load("en_core_web_sm")

Lastly, `src/utils.py` provides some utility methods to download and decompress files. Since the datasets used in some of the notebooks are of considerable size -- although far from huge -- they are not part of the repository and need to be downloaded (and optionally decompressed) separately. The 2 methods `download_file` and `decompress_file` accomplish this for convenience.

In [None]:
from src.utils import download_file, decompress_file

**Important:** The code cells below to download the file naturally include the URLs of the files. However, there is always the chance that one of those files might be removed or renamed, in which case the URL will now longer be valid. In this case, it is recommended to search for alternative links using, e.g., Google or Bing, which should cause no problems as all datasets used here are generally widely available.

---

## Download Dataset

The [Large Movie Review Datase](https://ai.stanford.edu/~amaas/data/sentiment/), commonly known as the IMDb dataset or IMDb movie reviews dataset, is a widely used benchmark dataset in natural language processing (NLP) and sentiment analysis. Created by Andrew Maas and a group of researchers at Stanford University, this dataset consists of movie reviews collected from IMDb (Internet Movie Database).

Here are the key characteristics of the Large Movie Review Dataset:

* **Data Size:** It contains a collection of 50,000 movie reviews.

* **Review Split:** The dataset is evenly divided into two sets:
    * 25,000 reviews for training
    * 25,000 reviews for testing

* **Sentiment** Labels: Each review is labeled with sentiment polarity:
    * 50% of reviews are labeled as positive
    * 50% of reviews are labeled as negative

* **Binary Classification Task:** The dataset is commonly used for binary sentiment classification tasks, where the goal is to classify whether a review expresses positive or negative sentiment.

This dataset serves as a standard benchmark for sentiment analysis and text classification algorithms, enabling researchers and developers to evaluate and compare the performance of different machine learning and deep learning models in sentiment classification tasks. The availability of labeled data in large quantities allows for the training and evaluation of models to predict sentiment accurately, making it a valuable resource in the field of natural language processing and sentiment analysis research.

Given its size, the dataset is not included in the Github repository. You can either download the dataset yourself using the link above, or you can run the notebook "Representations (Word2Vec - Data Preparation)" first which downloads the dataset for you.

If you have already downloaded and decompressed the dataset in the previous notebook, you can skip the code cell below. Otherwise run the code cell to fetch the dataset. We recommend using the given `target_path` as this won't require any additional changes in subsequent code cells.

In [None]:
# print('Download file...')
# download_file('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', target_path='data/datasets/imdb-reviews/')
# print('Decompress file...')
# decompress_file('data/datasets/imdb-reviews/aclImdb_v1.tar.gz', target_path='data/datasets/imdb-reviews/')
# print('DONE.')

The dataset comes organized in multiple folders, each folder containing many files where one file represents one review. For iterating over each review -- recall that each review is represented by its own file -- we can also prepared the folders for later code cells.

In [None]:
corpus_base_path = 'data/datasets/imdb-reviews/'

folders = [
    corpus_base_path+'aclImdb/test/pos',
    corpus_base_path+'aclImdb/test/neg',    
    corpus_base_path+'aclImdb/train/pos',
    corpus_base_path+'aclImdb/train/neg',
    corpus_base_path+'aclImdb/train/unsup'    
]

num_reviews = 0

for folder in folders:
    num_reviews += sum([len(files) for r, d, files in os.walk(folder)])

num_reviews = min(num_reviews, 999999999)    
    
print("Total number of reviews: {}".format(num_reviews))

---

## Generating CBOW and Skipgram Training Datasets

### Auxiliary Method for Data Cleaning & Preprocessing

The method `process_file()` below takes a single review file as input and returns all valid tokens as a list. This includes that the method removes all punctuation marks and stopwords. The method performs lemmatization, which is arguably not as obvious. The decision to lemmatize words when training word embeddings depends on the specific use case and the desired characteristics of the word embeddings. Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. For example, lemmatizing the words "running," "ran," and "runs" would result in the common lemma "run." Lemmatization helps reduce the sparsity of the data and can group together different inflected forms of a word, which can be beneficial in certain scenarios. Here are a few considerations regarding lemmatization when training word embeddings:

* **Reducing Dimensionality:** Lemmatization can reduce the dimensionality of the vocabulary and the resulting word embeddings. By mapping multiple inflected forms to a single lemma, you can potentially reduce the overall vocabulary size and improve the efficiency of the training process.

* **Generalization:** Lemmatizing words can help capture the general semantic meaning of a word, as it removes specific tense, case, or number information. This can be advantageous if you want your word embeddings to capture broader semantic relationships.

* **Preserving Word Variations:** On the other hand, if your application requires sensitivity to word variations, such as capturing different verb tenses or noun plural forms, you may choose not to lemmatize the words. By preserving the different forms, the resulting word embeddings may better capture the specific nuances or syntactic patterns associated with those variations.

* **Task-Specific Considerations:** The choice of lemmatization may depend on the specific downstream task you plan to use the word embeddings for. Some tasks, like part-of-speech tagging or named entity recognition, may benefit from lemmatization to reduce word form variability. However, other tasks, like sentiment analysis or text classification, may require word embeddings that retain specific word forms to capture sentiment or emphasis associated with those forms.

In summary, whether to lemmatize words during training depends on your specific requirements and the characteristics you want the resulting word embeddings to capture. Both lemmatized and non-lemmatized word embeddings have their own advantages and limitations, so it's essential to consider the specific needs of your application when making this decision. For the purpose of this and subsequent notebooks, we want to keep it simple and trying to minimize the vocabulary and therefore perform lemmatization

Since the movie reviews can include HTML tags, we remove those as well using RegEx. Again, anything here is kept to a bare minimum to keep things short and simple. Feel free to put in more thoughts into potentially better preprocessing steps. We won't really use the trained word embeddings for any downstream task, so there is no harm trying different alternatives.

In [None]:
def process_file(file_name):
    text = None
    with open(file_name, 'r', encoding='utf-8') as file:
        text = file.read().replace('\n', '')
        
    if text is None:
        return

    ## Remove HTML tags
    p = re.compile(r'<.*?>')
    text = p.sub(' ', text)
    
    ## Let spaCy do its magic
    doc = nlp(text)
    
    ## Return "proper tokens" (lemme, lowercase, stopword removal)
    return [ t.lemma_.lower() for t in doc if t.pos_ not in ['PUNCT'] and t.dep_ not in ['punct'] and t.lemma_.strip() != '' and t.is_stop == False ]


process_file('data/datasets/imdb-reviews/aclImdb/train/pos/0_9.txt')

### Process Review Files

The code cell below iterates over all text files representing the movie reviews in the specified folders, see above. For each review, we first extract all the tokens using the method `process_file()`. This returns the list of relevant tokens for this review which append to a list of all tokens across all reviews.

For each token, we also keep track of its count. We only need this to later create the final vocabulary by only looking at the top-k (e.g., top-20k most frequent) words. For testing, we recommend using a lower value for `num_reviews` (e.g., 1000) to see if this and the other notebooks are working (of course, the results won't be great). Once you think all is good, you can set `num_reviews` to infinity to work on the whole dataset.

In [None]:
considered_num_reviews = min(num_reviews, 99999999)

print(considered_num_reviews)

In [None]:
def process_reviews(folders):
    tokens = []                # List of all tokens
    token_counter = Counter()  # Dictionary with all tokens and their frequencies
    review_count = 0   # Running counter of process reviews
    # Iterate over all reviews
    with tqdm(total=considered_num_reviews) as progress_bar:
        for folder in folders:
            for file_name in os.scandir(folder):
                # Ignore directories (just a fail-safe; not really needed)
                if file_name.is_file() is False:
                    continue
                # Preprocess review
                sentence_tokens = process_file(file_name.path)
                # Add all extracted sentences to final list
                tokens.extend(sentence_tokens)
                # Update token counts
                for token in sentence_tokens:
                    token_counter[token] += 1
                # Update progress bar
                progress_bar.update(1)
                # Check if we need to stop early
                review_count += 1
                if review_count >= considered_num_reviews:
                    return tokens, token_counter
    # Return sentences and token counts
    return tokens, token_counter
                
tokens, token_counter = process_reviews(folders)  
    
            
print('Total number of tokens: {}'.format(len(tokens)))
print('Number of unique tokens: {}'.format(len(token_counter)))

It's important to note that `tokens` now contains all relevant tokens from all the considered reviews. In other words, we have concatenated all reviews into one long list of tokens. This also means that we completely ignore any sentence boundaries, which in turn means that the context of a word may belong to 2 sentences (or even more if the sentences are very short and the context window is large). There are different arguments why or why not this is a proper approach, but it's certainly not uncommon, and it keeps the code in this notebook simple. In practice, more thought goes into these design decisions for the data preparation. Here we can keep it simple by actually utilizing the fact that our corpus is very domain specific (movie reviews).

### Create & Save Vocabulary

For using the dataset to train a PyTorch model, we need to map each unique word/token to a unique index (i.e., integer identifier). Given a vocabulary size of `V` these unique indices must be of the range from `0` to `V-1`. This is needed since at the end, training a model using the data comes to matrix/tensor operations and we use identifiers to index the respective tensors. Also, we often want to do additional steps such as considering only the top-k most frequent tokens. Again, it's not difficult to implement this from scratch, however, the `torchtext` text simplifies this resulting in cleaner code.

At least once we use all reviews, the number of unique tokens will be quite large. However, we already know that the vocabulary will contain many tokens that occurred maybe only once or twice. These tokens are not really useful for training word embeddings to begin with. We therefore only consider the most frequent tokens for the vocabulary. The method `process_reviews` already returns the number of occurrences for each token. So if we want to limit the total number of tokens, we simply need to pick the most frequent tokens using those counts. The code cell below accomplishes this, considering the top 20k tokens by default.

In [None]:
TOP_TOKENS = 20000

# Sort with respect to frequencies
token_counter_sorted = sorted(token_counter.items(), key=lambda x: x[1], reverse=True)

token_ordered_dict = OrderedDict(token_counter_sorted[:TOP_TOKENS])

We can now create a `vocab` object. In its core, it creates the mappings between the tokens and their indices. It also support some additional useful features:

* For many tasks, we need to include special tokens in our vocabulary. For example, we often need a special token (e.g., `<PAD>`) to represent an "empty" word we can use to pad sequence (see also the other notebooks). Even more common is a special token (e.g., `UNK`) to represent tokens that haven't been seen when building the vocabulary. Not that the exact string for those tokens do not matter. For example, we could have used, say, `[[[padding]]]` and `[[[unseen]]]`. It's only important that those tokens are unique. In the code cell below, we also add `<SOS>` (start of sequence) and `<EOS>` (end of sequence). These are typically required for tasks such as machine translation. While not needed here, it's no harm having them either.

* By using `set_default_index()` we can specify the default index to be used if a sentence we want to transform contains a word not seen before. Most intuitively, we will use the index representing the special token `<UNK>`.

Strictly speaking, for training the word embeddings, only the `<UNK>` token is required. However, adding the other special tokens does not negatively affect the training, and adding those tokens could come in hand if we want to use the vocabulary and dataset for other training task where the tokens a required.

In [None]:
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"
SOS_TOKEN = "<SOS>"
EOS_TOKEN = "<EOS>"

SPECIALS = [PAD_TOKEN, UNK_TOKEN, SOS_TOKEN, EOS_TOKEN]

vocabulary = vocab(token_ordered_dict, specials=SPECIALS)

vocabulary.set_default_index(vocabulary[UNK_TOKEN])

print("Number of tokens: {}".format(len(vocabulary)))

In [None]:
vocabulary_file_name = corpus_base_path+"imdb-word2vec-{}.vocab".format(TOP_TOKENS)

torch.save(vocabulary, vocabulary_file_name)

### Generate Dataset from Extracted Tokens

Recall, that `tokens` contains all the tokens from all the reviews in a single list without any consideration of sentence boundaries. As such, we can simply move a sliding window over the whole list to capture the current context and the current center word. We create both datasets as Numpy arrays containing the indices of the context words and the center word. Recall from the lecture that a context and center word results in (a) on sample for the CBOW datasets and (b) `2*window_size` samples for the Skip-gram dataset. The figure below shows the relevant part from the lecture slides

<img src='data/images/lecture-slide-07.png' width='50%' />

The code cell below uses a loop to move a sliding window over all tokens to generate the CBOW and Skip-gram samples as illustrated above.

In [None]:
window_size = 5

# Given the window size, we can directly infer the required sizes for the 2 Numpy arrays
cbow = np.zeros(( len(tokens)-(2*window_size) , (2*window_size)+1 ), dtype=np.int32)
skipgram = np.zeros(((len(tokens)-(2*window_size))*(2*window_size), 2), dtype=np.int32)

# Loop through list of tokens
with tqdm(total=cbow.shape[0]) as pbar:
    for center_idx, pos in enumerate(range(window_size, len(tokens)-window_size)):

        # Get current center word and current context words
        center = tokens[pos]

        context = tokens[pos-window_size:pos] + tokens[pos+1:pos+window_size+1]

        # A CBOW sample is an array containg 2*window_size context word indices + the center word index
        cbow_sample = np.array( vocabulary.lookup_indices(context) + vocabulary.lookup_indices([center]) )

        cbow[pos-window_size] = cbow_sample

        # Loop over all context words to generate the 2*window_size (center_word, context_word)-pairs
        for idx, c in enumerate(context):
            skipgram_sample = np.array(vocabulary.lookup_indices([center]) + vocabulary.lookup_indices([c]) )
            skipgram[(center_idx*window_size*2)+idx] = skipgram_sample
            
        # Uupdate progress bar
        pbar.update(1)

Again, we save our datasets to be later used in the training notebooks.

In [None]:
np.save(corpus_base_path+'imdb-dataset-cbow.npy', cbow)
np.save(corpus_base_path+'imdb-dataset-skipgram.npy', skipgram)

---

## Discussion

As mentioned in the beginning, using the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) to train word embeddings has some clear limitations. As our goal is not to train word embeddings to be used for downstream tasks, but to understand and replicate the basic strategies, this is not an issue here. However, it's worthwhile to highlight some of the limitations and to address them when it comes to training "proper" word embeddings.

* Apart from its small size, the dataset used here is very domain-specific, containing only movie reviews. This means that many words -- more common in other domains -- might not appear at all, or that words with multiple meanings will only be used in a single context. So while we could use word embeddings trained over this dataset for our sentiment classifier over the same data (see the notebook covering the RNN-based sentiment classification model), they arguably are not suitable for tasks in different domains.

* The dataset used in this notebook was small enough that we could easily load the into the main memory. However, datasets to build proper language models are huge and would not fit into the memory all at once. In this case, some logic is required to first split the whole dataset into multiple chunks (e.g., different files) and then iterate over all chunks within each epoch.

* In practice, large language models are typically trained in a distributed setting such as computing clusters housing many CPUs. Deep learning frameworks such as PyTorch and Tensorflow support distributed training and inferencing out of the box, but again, some additional logic is required to facilitate this.

---

## Summary

Data preparation for training word embeddings is a crucial and challenging step in NLP tasks. Word embeddings, which represent words as dense vector representations in a high-dimensional space, have become an essential tool for various NLP applications such as machine translation, sentiment analysis, and named entity recognition.

The importance of data preparation lies in the fact that word embeddings heavily rely on the context in which words appear. Therefore, the quality and diversity of the training data greatly impact the resulting word embeddings. To ensure accurate and meaningful representations, the data needs to be large, diverse, and representative of the target domain. This often requires extensive preprocessing, including text cleaning, normalization, tokenization, and removal of stopwords, punctuation, and special characters.

Additionally, the challenge in data preparation arises from the inherent complexity of natural language. Language is highly nuanced, with varying sentence structures, grammar rules, and word meanings. Ambiguities, homonyms, and polysemy further complicate the task. Furthermore, handling out-of-vocabulary (OOV) words that do not appear in the training data requires special attention. Techniques such as subword tokenization or incorporating external resources like pre-trained embeddings can help address this challenge.

Moreover, data preparation for training word embeddings needs to consider the size and quality trade-off. Large datasets can improve the coverage and generalization of embeddings, but they also require significant computational resources and time for processing. Balancing the need for a sufficiently large corpus with limited resources is a constant consideration.

In conclusion, data preparation for training word embeddings is critical for generating meaningful and accurate representations of words. It involves cleaning and preprocessing the data to ensure its quality, diversity, and representativeness. The challenges lie in the complex nature of language, including nuances, ambiguities, and OOV words. Striking a balance between dataset size and quality is also a key consideration. Effective data preparation is essential for achieving high-quality word embeddings and subsequently improving the performance of NLP tasks.