# Word Embeddings
***
## Table of Contents
1. [Introduction](#introduction)
2. [Frequency-Based Word Embeddings](#2-frequency-based-word-embeddings)
    - [One-Hot Encoding](#one-hot-encoding)
    - [Bag of Words (BoW)](#bag-of-words-bow)
    - [Term Frequency-Inverse Document Frequency (TF-IDF)](#term-frequency-inverse-document-frequency-tf-idf)
    - [Global Vectors for Word Representation (GloVe)](#global-vectors-for-word-representation-glove)
        - [Co-Occurence Probability Ratio](#co-occurrence-probability-ratio)
        - [Loss Function](#loss-function)
        - [Weighting Function](#weighting-function)
3. [Prediction-Based Word Embeddings](#3-prediction-based-word-embeddings)
    - [Word2Vec](#word2vec)
        - [Continuous Bag of Words (CBOW)](#continuous-bag-of-words-cbow)
        - [Skip-Gram](#skip-gram)
    - [fastText](#fasttext)
4. [Contextual-Based Word Embeddings](#4-contextual-based-word-embeddings)
    - [Bidirectional Encoder Representations from Transformers (BERT)](#bidirectional-encoder-representations-from-transformers-bert)
    - [Generative Pre-trained Transformer (GPT)](#generative-pre-trained-transformer-gpt)
    - [Embeddings from Language Models (ELMo)](#embeddings-from-language-models-elmo)
***

In [45]:
import numpy as np

## 1. Introduction
Vectorisation in Natural Language Processing (NLP) is the process of converting text into numerical representations (vectors) that machine learning models can process and analyse natural language data. 

Word embeddings are a specific type of vectorisation technique that represents words as dense vectors in a continuous, high-dimensional space. Unlike basic vectorisation methods, word embeddings are designed to reflext semantic and syntactic relationships between words. Hence, words with similar meanings or usage contexts are mapped to vectors that are close together in the embedding space.

## 2. Frequency-Based Word Embeddings
Frequency-based (or count-based) word embeddings represent words using statistics about their occurrences and co-occurrences in a cprpus. These methods do not use neural networks or prediction tasks. Instead, they rely on explicit counts and matrix manipulations to derive word vectors. Though they are typically simple, interpretable and easy to parallelise, they are unable to capture semantic similarity well.

### One-Hot Encoding
Each word in the vocabulary is represented by a sparse binary vector with a single $1$ at the position corresponding to the word and $0$ elsewhere. It does not capture semantic similarity between words.

In [46]:
from sklearn.preprocessing import OneHotEncoder

corpus = ['apple', 'banana', 'cherry', 'blueberry', 'apple']  # Sample

corpus_reshaped = np.array(corpus).reshape(-1, 1)  # Reshape

ohe = OneHotEncoder(sparse_output=False)  # Initialisation

one_hot_encoded_corpus = ohe.fit_transform(corpus_reshaped)  # One Hot Encode

print(one_hot_encoded_corpus)  # Output

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]


### Bag of Words (BoW)
Bag of Words is a simple technique that represents a document by the frequency of each word in the vocabulary, ignoring grammar and word order. Each document is a vector of word counts. We can implement BoW using `CountVectorizer` method from scikit-learn library.

In [47]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]  # Sample

vectoriser = CountVectorizer()  # Initialisation
X = vectoriser.fit_transform(corpus)  # Apply CountVectorizer()

print(vectoriser.get_feature_names_out())  # Output (features)
print(X.toarray())  # Output (BoW matrix)

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


### Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency-Inverse Document Frequency (TF-IDF) extends BoW by weighting each word based on their **frequency** in a document (TF) and their **rarity** across all documents (IDF). This helps highlight important words while reducing the influence of common ones such as 'the' or 'and'. The scikit-learn library provides `TfidfVectorizer` method to implement TF-IDF.

\begin{align*}
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
\end{align*}

where:
$t$: Speficic term (word) being evaluated
$d$: Single document within the corpus
$D$: Set of all documents


- **Term Frequency (TF)**: Measures how often term $t$ appears in document $d$, normalised by the document's length.

\begin{align*}
\text{TF}(t, d) = \dfrac{\text{Number of times } t \text{ appears in } d}{\text{Total terms in } d}
\end{align*}

- **Inverse Document Frequency (IDF)**: Penalises terms common across many documents.

\begin{align*}
\text{IDF}(t, D) = \text{log} \left(\dfrac{\text{Total documents in corpus} (N)}{\text{Documents containing } t} \right)
\end{align*}

or, **smoothed IDF** (used in scikit-learn to avoid division by zero):

\begin{align*}
\text{IDF}(t, D) = \text{log} \left(\dfrac{N + 1}{\text{Documents containing } t + 1} \right) + 1
\end{align*}


*Example*: If the word 'vector' appears 8 times in a 200-word document, within 50 among 10000 documents:

\begin{align*}
\text{TF}(8, 200) = \dfrac{8}{200} = 0.04
\end{align*}

\begin{align*}
\text{IDF}(50, 10000) = \text{log} \left(\dfrac{10000}{50}\right) = \text{log}(200) \approx{2.30}
\end{align*}

\begin{align*}
\text{TF-IDF} = \text{TF} \times \text{IDF} = 0.04 \times 2.30 = 0.092
\end{align*}

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]  # Sample

vectoriser = TfidfVectorizer()  # Initialisation
X = vectoriser.fit_transform(corpus)  # Apply TfidfVectorizer()

print(vectoriser.get_feature_names_out())  # Output (features)
print(X.toarray())  # Output (TF-IDF matrix)

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


### Global Vectors for Word Representation (GloVe)
Global Vectors for Word Representation (GloVe) is an unsupervised learning algorithm developed by Stanford to obtain vector representations of words. GloVe is a count-based method that leverages global word co-occurence statistics from a corpus. The key idea is that ratios of word-word co-occurence probabilities encode meaning, and by factorising a co-occurence matrix, GloVe learns word vectors such that word similarities and analogies are preserved.

#### Co-Occurrence Probability Ratio
GloVe uses the ratio $P_{ik}/P_{jk}$ of co-occurrence probabilities to capture semantic relationships, where: \
$ P_{ik} + \dfrac{X_{ik}}{X_{i}}$ (probability of word $k$ occurring in context of word $i$).

#### Loss Function
The objective of GloVe is to minimise:
\begin{align*}
J = \sum_{i,j=1}^{V} f(X_{ij}) (w^{T}_{i} \tilde w_{j} + b_{i} + \tilde b_{j} - \text{log} X_{ij})^2
\end{align*}

where:

- $w_{i}$, $\tilde w_{j}$: Word and context vectors.
- $X_{ij}$: Co-occurrence count of words $i$ and $j$.
- $f(X_{ij})$: Weighting function that discounts frequent co-occurences.


#### Weighting Function

\begin{align*}

 f(x) & = \begin{cases}
      \left(\dfrac{x}{x_{max}}\right)^\alpha   &   \text{if } x < x_{max}     \\
      1 &   \text{otherwise}     \\
      \end{cases}
\end{align*}

where:
$\alpha$ is a hyperparameter, typically $=0.75$, and $x_{max} = 100$


The developers of GloVe provide pre-trained word vectors trained on large corpora on their [official web page](https://nlp.stanford.edu/projects/glove/). 
A part of the following code was retrieved from [Spot Intelligence](https://spotintelligence.com/2023/11/27/glove-embedding/).

In [49]:
# Load GloVe embeddings into a dictionary
def load_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

In [50]:
glove_embeddings_path = '_datasets/glove.6B.100d.txt'
glove_embeddings = load_embeddings(glove_embeddings_path)
len(glove_embeddings)

400001

In [51]:
# Accessing word embeddings
word = 'wikipedia'
if word in glove_embeddings:
    embedding = glove_embeddings[word]
    print(f"Embedding for '{word}': {embedding}")
else:
    print(f"'{word}' not found in embeddings")

Embedding for 'wikipedia': [-0.2536    0.21885   0.53349  -0.52444   0.53655   0.62448  -0.12989
 -0.83826   0.89195   0.033484  0.42016   0.44988   0.094579 -0.92764
 -0.48991   0.75895   0.48858  -0.57347  -0.75298   0.53346  -0.72722
  0.41164   0.049068  0.59324   0.028872 -1.4469    0.072449 -0.051847
  0.36257   0.1662    0.022671  1.263    -0.634    -0.72939   0.29486
  0.41603  -0.40253  -0.21218  -0.71229  -0.04464  -0.80034   0.83279
 -0.24826   0.61856  -0.26476   0.38703  -0.026548 -0.85908   0.34218
  0.28381   0.79504   0.78182  -0.81676  -0.023553 -1.4282   -0.065081
 -0.36143  -0.38418   0.49508  -0.079691 -0.21495   0.3556   -0.55288
 -0.14088   1.3684    0.29986  -0.051735 -0.27049   0.65376  -0.31637
  0.28904   1.4105    0.90976  -0.22609  -0.31961   0.036672  0.99641
  0.50815  -0.35471  -0.56741  -0.58292  -0.41092   0.28246  -0.31194
 -0.50438  -0.1069    0.080875 -0.75075  -0.087019  0.22302   0.011673
  0.70839   0.014801 -0.29071   1.0279   -0.27078  -0.17947 

For GloVe algorithm (and most word emedding algorithms), similar words are placed close together in the vector space. Thus, the standard way to measure how close (similar) the two words are is by computing cosine distance. Low value ($\approx 0$) means high similarity, high value ($\approx 2$) means high dissimilarity.

\begin{align*}
\text{Cosine Distance} = 1 - \text{Cosine Similarity}
\end{align*}


In [52]:
# Finding similarity between word embeddings
from scipy.spatial.distance import cosine

word1 = 'king'
word2 = 'queen'
similarity = 1 - cosine(glove_embeddings[word1], glove_embeddings[word2])
print(f"Similarity between '{word1}' and '{word2}': {similarity:.4f}")

Similarity between 'king' and 'queen': 0.7508


In [53]:
word1 = 'bird'
word2 = 'seagull'
similarity = 1 - cosine(glove_embeddings[word1], glove_embeddings[word2])
print(f"Similarity between '{word1}' and '{word2}': {similarity:.4f}")

Similarity between 'bird' and 'seagull': 0.0718


## 3. Prediction-Based Word Embeddings
Prediction-based models are methods for learning word embeddings by predicting contextual relationships between words. These models rely on neural networks to capture the semantic and syntactic relationships between words based on their usage in a large corpus. The key idea is to learn dense, low-dimensional vectors for words by optimising a prediction task, such as:

- Predicting a word given its context.

- Predicting the context given a word.


### Word2Vec
Word2Vec is a family of neural network models that learn word embeddings by leveraging the distributional hypothesis: *words appearing in similar contexts tend to have similar mearning*. Word2Vec is designed to learn high-quality word embeddings by using two architectures:

- **Continuous Bag of Words (CBOW)**: Predicts a target word from its surrounding context words.

- **Skip-gram**: Predicts context words from a given target word.

Both models use a shallow neural network with one hidden layer, and are trained on large text corpora to optimise word vectors such that words appearing in similar contexts have similar embeddings.

#### Continuous Bag of Words (CBOW)
Continuous Bag of Words (CBOW) is a Word2Vec architecture where the model predicts a target (centre) word $w_{t}$ based on the context words $w_{t-n}, \cdots , w_{t-1}, w_{t+1}, \cdots, w_{t+n}$. For example, given the context ['the', 'cat', 'on', 'the'], the model predicts the center word 'sat' in the phrase 'the cat sat on the mat'.

- **Hidden Layer**:
If $v_{w_{i}}$ is the embedding for context word $w_{i}$, the hidden layer output is:

\begin{align*}
h = \dfrac{1}{2m} \sum_{-n \leq j \leq n, l \neq 0} v_{w_{t+l}}
\end{align*}

- **Activation Function (Softmax)**:
The probability of prediction the centre word $w_{t}$ is:

\begin{align*}
p(w_{t}| \text{context}) = \dfrac{\text{exp}(u^{T}_{w_{t}}h)}{\sum_{w \in V} \text{exp} (u^{T}_{w}h)}
\end{align*}

- **Objective Function**:
The objective is to maximise the log-likelihood over the corpus:

\begin{align*}
J = \dfrac{1}{T} \sum_{t=1}^{T} \text{log } p(w_{t}|w_{t-n}, \cdots , w_{t-1}, w_{t+1}, \cdots, w_{t+n})
\end{align*}


#### Skip-Gram
Skip-Gram predicts context words given a centre word. For each word in the corpus, the model tries to predict the words within a window around it. Let the centre word be $w_{c}$, and context words $w_{O}$ within window size $n$:

- **Hidden Layer**:
The hidden layer output is the embedding vector $v_{w_{c}}$.

- **Output Layer (Softmax)**:
The probability of a context word $w_{O}$ given the centre word $w_{c}$ is:

\begin{align*}
p(w_{O}|w_{c}) = \dfrac{\text{exp}(u^{⊤}_{w_{O}}v_{w_{c}})}{\sum_{w \in V} \text{exp} (u^{⊤}_{w}v_{w_{c}})}
\end{align*}

- **Objective Function**:
The Skip-Gram model maximises the log probability of the context words given the centre word:

\begin{align*}
J = \dfrac{1}{T} \sum_{t=1}^{T} \sum_{-n \leq j \leq n, l \neq 0} \text{log } p(w_{t+j}|w_{t})
\end{align*}

where $T$ is the number of words in the corpus.


spaCy’s large (`en_core_web_lg`) and medium (`en_core_web_md`) models include 300-dimensional word vectors trained on massive corpora using Word2Vec-like algorithms.

In [54]:
import spacy

nlp = spacy.load("en_core_web_lg")  # Load the large English model

Similarity between words:

In [55]:
token_1 = nlp('king')[0]
token_2 = nlp('queen')[0]
print(f'Similarity between two words: {token_1.similarity(token_2):.4f}')

Similarity between two words: 0.7253


Similarity between sentences:

In [56]:
doc_1 = nlp("The cat sat on the mat.")
doc_2 = nlp("A dog rested on the rug.")
print(f'Similarity between two sentences: {doc_1.similarity(doc_2):.4f}')

Similarity between two sentences: 0.9074


### fastText
fastText is an open-source library developed by Facebook's AI Research (FAIR) team for efficient learning of word representations (embeddings) and text classification in natural language processing. fastText operates at the subword level by breaking words into character n-grams (subwords), allowing it to capture morphological information and handle out-of-vocabulary (OOV) words more effectively. It incorporates two key techniques, Hierarchical Softmax and Negative Sampling to optimise training efficiency:

- **Hierarchical Softmax**:
Instead of computing the probability distribution over all possible words, hierarchical softmax organises the vocabulary into a binary tree (often a Huffman tree). Each word becomes a leaf node in this tree.

- **Negative Sampling**:
Negative sampling updates the model with respect to only a few negative (incorrect) samples per training instance for a faster training process. 

In [57]:
import fasttext
import tempfile

sample_data = """king queen prince princess royal throne monarch emperor empress
apple fruit banana orange peel citrus
car vehicle bike truck highway road
dog cat animal pet wolf fox
paris france berlin germany london rome madrid
computer laptop tablet smartphone device
river lake ocean sea pond bay
happy sad angry joyful excited
run walk jump swim fly
red blue green yellow purple violet"""

with tempfile.NamedTemporaryFile(mode='w') as f:
    f.write(sample_data)
    f.flush()
    model = fasttext.train_unsupervised(f.name, minCount=1, epoch=1000)

similar = model.get_nearest_neighbors('prince', k=3)
print("Most similar words to 'prince':", similar)

Read 0M words
Number of words:  62
Number of labels: 0


Most similar words to 'prince': [(0.9920476078987122, 'princess'), (0.970518171787262, 'throne'), (0.969997763633728, 'queen')]


Progress: 100.0% words/sec/thread:   95002 lr:  0.000000 avg.loss:  4.118048 ETA:   0h 0m 0s


## 4. Contextual-Based Word Embeddings
Contextual-Based Word Embeddings are representations of words in a high-dimensional space where the meaning of a word depends on its context in a sentence. Unlike traditional word embeddings, contextual embeddings capture the semantic nuances of words based on the surrounding text. For example, the word 'bank' in 'river bank' and 'bank account' will have different embeddings because their meanings differ based on context. This dynamic nature allows for a more nuanced and accurate understanding of language, which is crucial for advanced NLP tasks such as sentiment analysis, machine translation, and information extraction.

They are typically generated using deep learning models such as recurrent neural network (RNNs), Transformers, or pre-trained language models (BERT, GPT, RoBERTa, etc.).


### Bidirectional Encoder Representations from Transformers (BERT)
Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based language model developed by Google. It is pre-trained using Masked Language Modelling (MLM) and Next Sentence Prediction (NSP). BERT can capture contexts from both directions and understand the relationships between words.

- **Applications**: Question answering, text classification, named entitiy recognition, etc.

In [58]:
from transformers import BertTokenizer, BertModel
import torch

tokeniser = BertTokenizer.from_pretrained(
    'bert-base-uncased')  # Load tokeniser
model = BertModel.from_pretrained(
    'bert-base-uncased')  # Load pre-trained BERT model
model.eval()

text = "The bank will not lend money to the river bank."  # Example

inputs = tokeniser(text, return_tensors='pt')  # Tokenise and encode
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # Shape: (1, seq_len, hidden_size)

tokens = tokeniser.convert_ids_to_tokens(
    inputs['input_ids'][0])  # Map tokens to embeddings
for token, embedding in zip(tokens, embeddings[0]):
    print(f"Token: {token}\tEmbedding shape: {embedding.shape}")

Token: [CLS]	Embedding shape: torch.Size([768])
Token: the	Embedding shape: torch.Size([768])
Token: bank	Embedding shape: torch.Size([768])
Token: will	Embedding shape: torch.Size([768])
Token: not	Embedding shape: torch.Size([768])
Token: lend	Embedding shape: torch.Size([768])
Token: money	Embedding shape: torch.Size([768])
Token: to	Embedding shape: torch.Size([768])
Token: the	Embedding shape: torch.Size([768])
Token: river	Embedding shape: torch.Size([768])
Token: bank	Embedding shape: torch.Size([768])
Token: .	Embedding shape: torch.Size([768])
Token: [SEP]	Embedding shape: torch.Size([768])


### Generative Pre-trained Transformer (GPT)
Generative Pre-trained Transformer (GPT) is a unidirectional language model that predicts the next token in a sequence (left-to-right), trained on a large corpus of text and can easily be fine-tuned. It uses the decoder part of the Transformer architecture.

- **Applications**: Text generation, code generation, text summarisation, etc.

To get embeddings:

In [None]:
from transformers import GPT2Tokenizer, GPT2Model

tokeniser = GPT2Tokenizer.from_pretrained('gpt2')  # Load tokeniser
model = GPT2Model.from_pretrained('gpt2')  # Load pre-trained model
model.eval()

prompt = 'Once upon a time in a land far away'  # Example text as a prompt

inputs = tokeniser(prompt,
                   return_tensors='pt')  # Tokenise prompt


with torch.no_grad():  # Get embeddings
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

tokens = tokeniser.convert_ids_to_tokens(inputs['input_ids'][0])
for token, embedding in zip(tokens, embeddings[0]):
    print(f"Token: {token}\tEmbedding shape: {embedding.shape}")

To generate a text:

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokeniser = GPT2Tokenizer.from_pretrained('gpt2')  # Load tokeniser
model = GPT2LMHeadModel.from_pretrained('gpt2')  # Load pre-trained model
model.eval()

prompt = 'Once upon a time in a land far away'  # Example text as a prompt

inputs = tokeniser(prompt,
                   return_tensors='pt')  # Tokenise prompt

output = model.generate(
    inputs['input_ids'],
    max_length=50,
    num_return_sequences=1,
    no_repeat_ngram_size=2,  # Prevent repeated phrases
    pad_token_id=tokeniser.eos_token_id
)

generated_text = tokeniser.decode(output[0], skip_special_tokens=True)
print(generated_text)

Once upon a time in a land far away, the sun was shining, and the moon was rising. The sun had risen, but the earth was still.

The sun rose, then, as it had rose before, to the sky,


### Embeddings from Language Models (ELMo)
Embeddings from Language Models (ELMo) is a contextual word embedding method that generates dynamic representations of words based on their context in a sentence. Unlike static embeddings (e.g., Word2Vec), ELMo uses a bidirectional LSTM architecture to capture syntactic and semantic nuances, allowing words such as 'bank' to have different embeddings in 'river bank' or 'financial bank'. It provides following features:

- **Bidirectional Context**: Processes text in both forward and backward directions to capture context from surrounding words.
- **Layer Aggregation**: Combines embeddings from multiple LSTM layers to form rich representations.

ELMo uses a two-layer bidirectional LSTM:
- **Forward Pass**: Predicts the next word in a sequence.
- **Backward Pass**: Predicts the previous word.
- **Layer Concatenation**: Outputs from both layers and the initial embedding layer are combined into a final embedding.