Introduction:
-------------

In this document, we are going to explain about the A la Carte Embeddings (ALC) but, before introducing it, we will cover some background regarding Natural Language Processing (NLP) basics and Word Embeddings.

We will start by exploring the basics of text preprocessing and representation. Then, we will introduce the concept of Word Embeddings and Finally start explaining ALC.


Text Representation:
--------------------

Text representation is the process of transforming text into a structured and numerical representation that can be understood by NLP or Machine Learning algorithms.

The **goal** of text representation is to **transform** text to **extract knowledge** and **meaning** about the data.

This is usually done by converting text into a vector of numbers that can be understood by the algorithms.

After the text is converted into a vector of numbers, we can obtain better understand of the data by applying different techniques like measuring the distance between the vectors.

The **most common** text representation techniques are:

- **Bag of Words (BoW)**: This technique represents text as a bag of words, where each word is represented by a number. The number represents the frequency of the word in the text. This technique is very simple and easy to implement, but it has some drawbacks. It does not consider the order of the words, and it does not consider the context of the words.

- **TF-IDF**: This technique is similar to BoW, but it also considers the importance of the word in the text. The importance of the word is calculated by multiplying the frequency of the word in the text by the inverse document frequency. The inverse document frequency is calculated by dividing the number of documents by the number of documents that contain the word. This technique is also simple and easy to implement, but it also has some drawbacks. It does not consider the order of the words, and it does not consider the context of the words.

- **Word Embeddings**: This technique represents words as vectors of numbers. The vectors are calculated by considering the context of the words. This technique is more complex and harder to implement, but it has some advantages. It considers the order of the words, and it considers the context of the words.

**`In this document, we will focus on Word Embeddings.`**


NLP preprocessing techniques:
-----------------------------

Before applying any text representation technique, we need to preprocess the text. The preprocessing techniques are used to clean the text and to prepare it for the text representation techniques.

The **most common** preprocessing techniques are:

- **Tokenization**: This technique is used to split the text into tokens. The tokens can be words, sentences, or paragraphs. The most common tokenization technique is word tokenization, where the text is split into words.
    - `Example`: `"I like to eat apples and bananas."` -> `["I", "like", "to", "eat", "apples", "and", "bananas"]`

- **Normalization**: This technique is used to convert the text to lowercase, remove punctuation, and remove numbers.
    - `Example`: `"I like to eat apples and bananas."` -> `"i like to eat apples and bananas"`

- **Stop words removal**: This technique is used to remove stop words from the text. Stop words are words that are very common in the language, and they do not add any value to the text.
    - `Example`:`"the"`, `"a"`, `"an"`, `"in"`, `"on"`, `"at"`.

- **Stemming**: This technique is used to reduce words to their root form. The most common stemming technique is Porter Stemming, where words are reduced to their root form by removing the suffixes.
    - `Example`: `"running" -> "run"`, `"eats" -> "eat"`, `"eating" -> "eat"`, `"a -> "a"`

- **Lemmatization**: This technique is used to reduce words to their root form. The most common lemmatization technique is WordNet Lemmatization, where words are reduced to their root form by using a dictionary.
    - `Example`: `"running" -> "run"`, `"eats" -> "eat"`, `"eating" -> "eat"`, `"a -> "a"`

**Note**: The difference between stemming and lemmatization is that stemming uses a set of rules to reduce words to their root form, while lemmatization uses a dictionary to reduce words to their root form. The result of stemming is not always a valid word, while the result of lemmatization is always a valid word. This also means that stemming is faster than lemmatization, but it is less accurate.

There are other preprocessing techniques which despite being common, we wont cover them in this document for the sake of simplicity but we can mention them: Part of Speech (POS) tagging, Named Entity Recognition (NER).

Preprocessing also involves making pipelines where we can combine different preprocessing techniques to obtain better results.
Example:

```
Preprocessing pipeline: Tokenization -> Normalization -> Stop words removal -> Stemming
```



In [67]:
def tokenize(text):
    """Tokenize text by splitting on whitespace
    Args:
        text (str): The string to tokenize
    Returns:
        list: The tokenized list of strings
    """
    tokens = text.split()
    return tokens

def normalize(tokens):
    """Normalize tokens by lowercasing them and removing punctuation
    Args:
        tokens (list): List of tokenized words
    Returns:
        list: List of normalized words
    """
    punctuation = ['.', ',', '!', '?', ';', ':']
    normalized_tokens = []

    for token in tokens:
        # Convert to lowercase
        normalized_token = token.lower()

        # Remove punctuation
        normalized_token = ''.join(char for char in normalized_token if char not in punctuation)
        normalized_tokens.append(normalized_token)
    
    return normalized_tokens

def stem(tokens):
    """Stem tokens using Porter stemmer
    Args:
        tokens (list): List of tokenized words
    Returns:
        list: List of stemmed words
    """
    stemmed_tokens = []
    for token in tokens:
        stemmed_tokens.append(token[:-1])
    return stemmed_tokens

def remove_stopwords(tokens):
    """Remove stop words from list of tokenized words
    Args:
        tokens (list): List of tokenized words
    Returns:
        list: List of tokenized words with stop words removed"""
    
    # list of stopwords
    stop_words = ['a', 'an', 'the']
    filtered_tokens = []
    for token in tokens:
        if token not in stop_words:
            filtered_tokens.append(token)
    return filtered_tokens

Applying the preprocessing pipeline to the text: `The quick brown fox jumps over the lazy dog.`

In [68]:
text = "The quick brown fox jumps over the lazy dog."

Tokenization

In [69]:
tokens = tokenize(text)
tokens

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

Normalization

In [70]:
tokens = normalize(tokens)
tokens

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Stop words removal

In [71]:
tokens = remove_stopwords(tokens)
tokens

['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']

Stemming

In [72]:
tokens = stem(tokens)
tokens

['quic', 'brow', 'fo', 'jump', 'ove', 'laz', 'do']

It is important to note that the preprocessing techniques involved might vary depending on the task at hand and while in the example above functions were implemented from sctach, there are libraries that can help us with this task, for example **`NLTK`**, **`Spacy`**, **`Gensim`**, **`Scikit-learn`**.

Now that we covered the basics of text preprocessing, we can introduce the concept of transforming this data to vector space. The result of this transformation is called a **`Document-Term Matrix`**. Since the number of zeros in this matrix is very high, it is usually also called **`Sparse Matrix`**.

We will also the Bag of Words approach and make use of a few libraries since we already know how to preprocess the data.


In [73]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit_transform(tokens)

# feature names
print(vectorizer.get_feature_names_out(), end='\n\n')

# vocabulary
print(f'Vocabulary{vectorizer.vocabulary_}', end='\n\n')

# print the vectorized sparse matrix
print(f'Sparse Matrix:\n{vectorizer.fit_transform(tokens).toarray()}')


['brow' 'do' 'fo' 'jump' 'laz' 'ove' 'quic']

Vocabulary{'quic': 6, 'brow': 0, 'fo': 2, 'jump': 3, 'ove': 5, 'laz': 4, 'do': 1}

Sparse Matrix:
[[0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0]
 [0 1 0 0 0 0 0]]


As we can see, the result of the Bag of Words approach is a Document-Term Matrix.

The rows of the matrix represent the documents, and the columns represent the words. The values of the matrix represent the frequency of the words in the documents.

The problem with this approach is that it does not consider the order of the words, and it does not consider the context of the words. For example, the words “good” and “great” have similar meaning, but they are represented by different numbers. Also, the words “good” and “bad” have opposite meaning, but they are represented by similar numbers.

Word Embeddings:
----------------

Word Embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

Word Embeddings are calculated by considering the context of the words. This means that words that have similar context will have similar representations.

Word2Vec is one of the most popular techniques to obtain Word Embeddings. It is a two-layer neural network that processes text. Its input is a text corpus, and its output is a set of vectors: feature vectors for words in that corpus. While Word2Vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.

Word2Vec has two variants: Skip-Gram and CBOW.

- Skip-Gram predicts the context given a word
- Continuous Bag of Words (CBOW) predicts a word given the context.

In [76]:
from gensim.models import Word2Vec

print(f'Original text:\n{text}\n')

print(f'Tokens:\n{tokens}\n')

# Define and train Word2Vec model
model_w2v = Word2Vec([tokens], vector_size=100, window=5, min_count=1, workers=4)

# Get embeddings for each token
word_embeddings = {word: model_w2v.wv[word].tolist() for word in tokens}

# Print the word embeddings
for word, embedding in word_embeddings.items():
    print(f"{word}: {embedding}")

Original text:
The quick brown fox jumps over the lazy dog.

Tokens:
['quic', 'brow', 'fo', 'jump', 'ove', 'laz', 'do']

quic: [0.008132271468639374, -0.00445733405649662, -0.0010683572618290782, 0.0010063648223876953, -0.00019111395522486418, 0.001148177427239716, 0.006113860756158829, -2.0271540051908232e-05, -0.0032459653448313475, -0.0015107286162674427, 0.00589729892089963, 0.0015141022158786654, -0.0007242619758471847, 0.009333247318863869, -0.004921283572912216, -0.0008384096436202526, 0.00917541142553091, 0.0067494274117052555, 0.0015028560301288962, -0.008882560767233372, 0.0011487459996715188, -0.0022882556077092886, 0.009368237107992172, 0.0012099278392270207, 0.0014900636160746217, 0.002406409941613674, -0.0018360066460445523, -0.004999633878469467, 0.00023242950555868447, -0.002014180412515998, 0.006600933149456978, 0.00894012302160263, -0.0006747543811798096, 0.0029770147521048784, -0.006107654422521591, 0.00169932481367141, -0.006926232483237982, -0.008694026619195938, -

Global Vectors (GloVe)
----------------------

GloVe is an unsupervised learning algorithm for obtaining word embeddings. GloVe is similar to Word2Vec, but it is different in the way it is trained.

GloVe is trained on the non-zero elements of the word-word co-occurrence matrix, which is a matrix that contains how frequently words co-occur with each other.

The result of GloVe is a matrix that contains the word embeddings for each word. The rows of the matrix represent the words, and the columns represent the dimensions of the embeddings.

The dimensions of the embeddings are usually between 50 and 300. The number of dimensions is a hyperparameter that can be tuned.

Then, it is important to highlight that:

- We can obtain the word embeddings by using the GloVe library which contains a dictionary with the word embeddings for the most common words in the English language.

- We can train our own word embeddings by using the GloVe library since it contains a function that allows us to train our own word embeddings using a corpus of text.

Link to the library: https://nlp.stanford.edu/projects/glove/

Link to the paper: https://nlp.stanford.edu/pubs/glove.pdf

References:
-----------

- [Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.](https://nlp.stanford.edu/IR-book/)
