# Word Embeddings
***
## Table of Contents
1. [Introduction](#introduction)
2. [Frequency-Based Word Embeddings](#2-frequency-based-word-embeddings)
    - [One-Hot Encoding](#one-hot-encoding)
    - [Bag of Words (BoW)](#bag-of-words-bow)
    - [Term Frequency-Inverse Document Frequency (TF-IDF)](#term-frequency-inverse-document-frequency-tf-idf)
***

In [1]:
import numpy as np

## 1. Introduction
Vectorisation in Natural Language Processing (NLP) is the process of converting text into numerical representations (vectors) that machine learning models can process and analyse natural language data. 

Word embeddings are a specific type of vectorisation technique that represents words as dense vectors in a continuous, high-dimensional space. Unlike basic vectorisation methods, word embeddings are designed to reflext semantic and syntactic relationships between words. Hence, words with similar meanings or usage contexts are mapped to vectors that are close together in the embedding space.

## 2. Frequency-Based Word Embeddings
Frequency-based (or count-based) word embeddings represent words using statistics about their occurrences and co-occurrences in a cprpus. These methods do not use neural networks or prediction tasks. Instead, they rely on explicit counts and matrix manipulations to derive word vectors. Though they are typically simple, interpretable and easy to parallelise, they are unable to capture semantic similarity well.

### One-Hot Encoding
Each word in the vocabulary is represented by a sparse binary vector with a single $1$ at the position corresponding to the word and $0$ elsewhere. It does not capture semantic similarity between words.

In [None]:
from sklearn.preprocessing import OneHotEncoder

corpus = ['apple', 'banana', 'cherry', 'blueberry', 'apple']  # Sample

corpus_reshaped = np.array(corpus).reshape(-1, 1)  # Reshape

ohe = OneHotEncoder(sparse_output=False)  # Initialisation

one_hot_encoded_corpus = ohe.fit_transform(corpus_reshaped)  # One Hot Encode

print(one_hot_encoded_corpus)  # Output

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]


### Bag of Words (BoW)
Bag of Words is a simple technique that represents a document by the frequency of each word in the vocabulary, ignoring grammar and word order. Each document is a vector of word counts. We can implement BoW using `CountVectorizer` method from scikit-learn library.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
] # Sample

vectoriser = CountVectorizer() # Initialisation
X = vectoriser.fit_transform(corpus) # Apply CountVectorizer()

print(vectoriser.get_feature_names_out()) # Output (features)
print(X.toarray()) # Output (BoW matrix)

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


### Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency-Inverse Document Frequency (TF-IDF) extends BoW by weighting each word based on their **frequency** in a document (TF) and their **rarity** across all documents (IDF). This helps highlight important words while reducing the influence of common ones such as 'the' or 'and'. The scikit-learn library provides `TfidfVectorizer` method to implement TF-IDF.

\begin{align*}
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
\end{align*}

where:
$t$: Speficic term (word) being evaluated
$d$: Single document within the corpus
$D$: Set of all documents


- **Term Frequency (TF)**: Measures how often term $t$ appears in document $d$, normalised by the document's length.

\begin{align*}
\text{TF}(t, d) = \dfrac{\text{Number of times } t \text{ appears in } d}{\text{Total terms in } d}
\end{align*}

- **Inverse Document Frequency (IDF)**: Penalises terms common across many documents.

\begin{align*}
\text{IDF}(t, D) = \text{log} \left(\dfrac{\text{Total documents in corpus} (N)}{\text{Documents containing } t} \right)
\end{align*}

or, **smoothed IDF** (used in scikit-learn to avoid division by zero):

\begin{align*}
\text{IDF}(t, D) = \text{log} \left(\dfrac{N + 1}{\text{Documents containing } t + 1} \right) + 1
\end{align*}


*Example*: If the word 'vector' appears 8 times in a 200-word document, within 50 among 10000 documents:

\begin{align*}
\text{TF}(8, 200) = \dfrac{8}{200} = 0.04
\end{align*}

\begin{align*}
\text{IDF}(50, 10000) = \text{log} \left(\dfrac{10000}{50}\right) = \text{log}(200) \approx{2.30}
\end{align*}

\begin{align*}
\text{TF-IDF} = \text{TF} \times \text{IDF} = 0.04 \times 2.30 = 0.092
\end{align*}

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
] # Sample

vectoriser = TfidfVectorizer() # Initialisation
X = vectoriser.fit_transform(corpus) # Apply TfidfVectorizer()

print(vectoriser.get_feature_names_out()) # Output (features)
print(X.toarray()) # Output (TF-IDF matrix)

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
