# L4a: Introduction to Natural Language Processing (NLP)
In this module, we will explore the basics of Natural Language Processing (NLP). NLP is a rapidly evolving field of artificial intelligence that focuses on the interaction between computers and humans through natural language. 

By the end of this module, you will be able to define and demonstrate mastery of the following key concepts:
* __Sentiment Analysis__ focuses on identifying and categorizing the emotional tone of textual data. It allows businesses and researchers to gauge public opinion, customer sentiment, and emotional responses in textual data.
* __Tokenization, Hashing, and Bag-of-Words (BoW)__ are fundamental techniques in NLP that transform text into numerical representations for analysis. Tokenization breaks text into smaller units, hashing creates fixed-size numerical representations of words, and BoW is a model that captures word frequency without considering word order.
* __Word Embeddings__ are dense vector representations of words that capture semantic relationships and contextual meanings. They enable machines to understand the meaning of words in relation to each other, enhancing NLP tasks such as sentiment analysis, language translation, and text classification.

We'll explore these concepts many times throughout several courses, starting with the basics of NLP and gradually building up to more advanced topics. So let's get started!
___

## Sentiment Analysis
The key use case that we will explore in this (and future) module is __Sentiment Analysis__. Sentiment analysis focuses on identifying and categorizing the emotional tone of textual data.

What are some use cases for sentiment analysis? Here are a few examples:
* __Products reviews__: You are a business owner and you want to know how your customers feel about a new product or service. You can do surveys with numerical ratings, but what if you want to understand the nuances of_written_ customer feedback? 
* __Social media__: You are a social media manager and you want to understand how your audience feels about your brand or a specific topic. You can analyze the sentiment of social media posts, comments, and mentions to gauge public opinion.
* __Good investment decisions__: You are an investor and you want to understand how the market feels about a specific company or industry. You can analyze news articles, press releases, company financial reports, and social media posts to gauge the sentiment of investors and analysts.

In all of these use cases, we want to understand (perhaps in addition to specific facts, or feedback) the _emotional tone_ of the text, i.e., is it positive, negative, or neutral? This is the essence of sentiment analysis.

__Ok, but how do we do this?__ There are many moving parts to this answer, but the first step is to convert the text into a numerical representation that we can analyze, using simple to complex models. What is our pipeline for this?
___

## Text Analysis Pipeline
At the core of NLP is the need to convert raw text into numerical representations that machine-learning algorithms can digest. We’ll focus on three fundamental building blocks:

1. **Tokenization**  Segment raw text into a sequence of discrete units called __tokens__ that serve as the foundation for any downstream processing. By chopping text into words, subwords, or characters, tokenization bridges the gap between unstructured language and mathematical models.

2. **Feature Hashing**  Map tokens directly into a fixed-size vector using a hash function (often called the “hashing trick,” from Weinberger *et al.*). This produces a constant-dimensional, representation without ever storing an explicit vocabulary. Collisions are possible, but in practice they’re rare enough to retain useful information.

3. **Bag-of-Words (BoW)**  Build an explicit vocabulary of all tokens in your corpus, then represent each document by a $|\mathcal V|$-dimensional count vector. BoW captures how often each word appears, ignoring the order or context of words, but yields highly interpretable features that work well for many tasks.

Once we have our tokens in hand, both feature hashing and BoW derive from the same raw counts; they just differ in **how** those counts get laid out in a vector space.

Let’s dive into each of these steps in detail, starting with tokenization.
___

### Vocabulary, Tokens, and Tokenization
Let $\mathcal{V}$ be the vocabulary of tokens (characters, sub-words, whole words, documents, etc) in our [corpus](https://en.wikipedia.org/wiki/Corpus), and let $V = |\mathcal{V}|$ be the size of the vocabulary. Let $\mathbf{x}\equiv \{x_1, x_2, \ldots, x_n\in\mathcal{V}\}$ be a sequence of tokens i.e., a sentence or document, where $n$ is the length of the sequence, and $x_i$ is the $i$-th token in the sequence.

Let's consider a simple example: `My grandma makes the best apple pie.`

Tokens are the basic units of text that we will be working with. In this space, tokens can be characters, sub-words, whole words, or documents. Converting a sequence of text into tokens is called _tokenization_.
* _Character-level tokenization_. Given the example above, one possible choice is to let the vocabulary $\mathcal{V}$ be the (English) alphabet (plus punctuation). Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 36: `[‘M’, ‘y’, ‘ ’, ..., ’.’]`. Character-level tokenization tends to yield _very long sequences_.
* _Word-level tokenization_. Another possible choice is to let the vocabulary $\mathcal{V}$ be the set of all words in the corpus. Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 8: `[‘My’, ‘grandma’, ‘makes’, ‘the’, ‘best’, ‘apple’, ‘pie’, ‘.’]`. Word-level tokenization tends to yield shorter sequences; however, it requires an extensive vocabulary and cannot handle new words at test time (without a little help).
* _Sub-word tokenization_. A third possible choice is to let the vocabulary $\mathcal{V}$ be the set of commonly occurring word segments like `cious`, `ing`, `pre`. Common words like `is` are often a separate token, and single characters are also included in the vocabulary $\mathcal{V}$ to ensure all words are expressible.

Given a choice of tokenization, each vocabulary element is assigned a unique index $\left\{1, 2,\dots, V\right\}$. Additionally, several special (control) tokens can be appended to the vocabulary $\mathcal{V}$, and assigned unique indices (the ordering of the special tokens is arbitrary):
* $\texttt{<bos>} \rightarrow V + 1$: the beginning of the sequence (bos) token is used to indicate the start of a sequence.
* $\texttt{<eos>} \rightarrow V + 2$: the end-of-sequence (eos) token is used to indicate the end of a sequence.
* $\texttt{<mask>} \rightarrow V + 3$: the `mask` token that is used to mask out a token in the input sequence. This is used in training to predict the masked word.
* $\texttt{<pad>} \rightarrow V + 4$: the `pad` token is used to pad sequences to a fixed length. This is used in training to ensure that all sequences in a batch have the same length.
* $\texttt{<unk>} \rightarrow V + 5$: the out-of-vocabulary (oov) token is used to represent tokens that are not in the vocabulary. This is used in training to handle unseen tokens gracefully, i.e., without crashing the model on text not used to establish the vocabulary.


A piece of text is represented as a sequence of indices (called token IDs) corresponding to its (sub)words, preceded by $\texttt{bos}$-token and followed by the $\texttt{eos}$-token.

### Feature Hashing
Feature Hashing is a technique used to convert text into a fixed-size numerical representation, __without__ the need for an explicit vocabulary. Let's look at a specific algorithm, the __Weinberger Feature Hashing Algorithm__. This is also known as the __hashing trick__. This approach does __not__ require an explicit vocabulary $\mathcal{V}$. Thus, it can handle large vocabularies and unseen words gracefully.

__Initialization:__ Given an array of tokens $\mathbf{X} = \{x_1, x_2, \ldots, x_n\}$, where $x_{i}\in\mathcal{V}$, and a dimension $d$, initialize a result array $\mathbf R = \mathbf 0\in\mathbb R^d.$


For each $x\in\mathbf{X}$ __do__:
1. Compute the hash value of the current token: $h \gets\texttt{hash}(x)$.
2. Compute the index of the hash value in the result array: $i \gets h \mod d$.
3. Update the result array: $\mathbf{R}_{i} \gets \mathbf{R}_{i} + 1$.

> **Note (Weinberger sign variant):**  
> Optionally use a sign function $s(x)\in\{+1,-1\}$ (e.g., low bit of $h$) so that  
> $$\mathbf R_i \;\mathrel{+}= s(x),$$  
> which helps decorrelate hash collisions.


**Example**  
```text
Tokens: ["Hello", "world!", "This", "is", "a", "test", "."]
d = 10

Possible output:
[0, 1, 4, 0, 2, 1, 0, 1, 1, 0]
```

#### Interesting features
This implementation has several interesting features. First, the sum of the elements is equal to the number of tokens in the input text, i.e., $\sum_{i=1}^{d} \mathbf{R}_{i} = n$. However, the elements of the output vector are not token counts, but rather the number of times a token hashes to a particular index in the output vector:
$$
j \gets {\texttt{hash}(x_{i})}\;\text{mod}\;{d}
$$
Thus, the output vector is not an actual count vector, but rather a __hashed count vector__. In addition, the output vector has several other interesting features:

* __Fixed-size output__: The output vector is always of size $d$, regardless of the size of the input text. This makes it easy to use in machine learning models that require fixed-size input vectors.
* __No explicit vocabulary__: The algorithm does not require an explicit vocabulary, which means it can handle large vocabularies and unseen words gracefully, e.g., without an out-of-vocabulary (OOV) token.
* __Sparse representation__: The output vector is often sparse when $d\gg n$, meaning that most of the elements are zero. This can be efficient for storage and computation, especially when using sparse matrix representations.


Feature hashing is one text-to-numbers technique that we will explore. Let's look at another one, the __Bag-of-Words (BoW)__ model.
___

### Bag-of-Words (BoW) Model
The Bag-of-Words model converts an input token sequence into a high-dimensional count (or frequency) vector by first building an explicit vocabulary $\mathcal{V}$. It is simple, interpretable, and effective for many classic NLP tasks.

> __Model:__ Suppose you have a corpus of text documents, e.g., a collection of news articles, product reviews, or any other text data, and you want to represent each document as a vector of word counts, or perhaps word frequencies. The Bag-of-Words model does this by counting how many times each word appears in the document, _ignoring the order of words_.

First, we build the base vocabulary $\mathcal{V}$ (words, sub-words, characters, etc) from the corpus, $\mathcal V = \{w_1, w_2, \dots, w_V\}$ where the number of elements in the vocabulary is $V = |\mathcal V_{\text{base}}|$. Optionally we add some special tokens, e.g., `<unk>` for out-of-vocabulary words, `<bos>`, `<eos>`, etc., making the full vocabulary size $N_{\mathcal V}=V+K$, where $K$ is the number of special tokens. 

Let's take a look at a simple algorithm to construct a Bag-of-Words model from a sequence of tokens $\mathbf X=(x_1,\dots,x_n)\in\mathcal V^n$. The algorithm will produce a count vector $\mathbf C\in\mathbb R^{N_{\mathcal V}}$ that contains the counts of each token in the sequence $\mathbf X$ (raw counts version).

__Initialization:__ Given an array of tokens $\mathbf{X} = \{x_1, x_2, \ldots, x_n\}$, and a vocabulary size $N_{\mathcal V}$, initialize a result vector $\mathbf C = \mathbf 0\in\mathbb R^{N_{\mathcal V}}.$

For each token $x\in\mathbf X$ __do__:
1. Check if $x$ is in the vocabulary $\mathcal V$. 
    - Set $j\gets \texttt{ID}(x)$, if $x\in\mathcal V$, otherwise $j\gets \texttt{ID}(\text{<unk>})$ token.
2. Increment the count:  $C_j \;\mathrel{+}= 1.$

> **Variant (term frequency–inverse document frequency, tf-idf):**  
> After collecting raw token counts $C_j$, you can weight each entry by  
> $$\mathrm{tfidf}_j = \frac{C_j}{|\mathbf{X}|} \times \log\left(\frac{N_{\text{docs}}}{1 + \text{df}_j}\right),$$  
> where $\text{df}_j$ is the number of documents containing token $x_j$, $N_{\text{docs}}$ is the total number of documents, and $|\mathbf{X}|$ is the length of the document. This gives more weight to rare words across documents, reducing the impact of common words.

**Example**  
```text
Tokens: ["Hello", "world!", "This", "is", "a", "test", "."]
Vocabulary: ["<unk>", ".", "Hello", "a", "is", "test", "This", "world!"]
V = 8

Raw counts vector C (by index):
 [0, 1, 1, 1, 1, 1, 1, 1]
````

Here, `<unk>` = 0 (no OOV), `"."`=1, `"Hello"`=1, …, `"world!"`=1.

#### Interesting features

* __Vocabulary-driven size__ Your feature vector has one element for each token in the vocabulary. That’s helpful when your corpus remains small, but as you add more documents, that vector can grow to tens or hundreds of thousands of dimensions. Basically, every new word you come across becomes a new entry in your vocabulary.
* __Interpretability__ Since each index maps directly to a known word or special token, you can always trace an element of the vector back to its corresponding word. For example, if you see a non-zero value in slot 42, it indicates the word “quantum” in your vocabulary. This is a key advantage over learned embeddings, which are less interpretable.
* __Order-agnostic counts__ BoW counts the number of times each word appears; it doesn’t consider where in the sentence it appears. That means you can capture _what_ is said without worrying about _how_ it’s said (context). For many tasks (spam detection, topic classification), that’s enough; for others (syntax parsing, complex sentiment analysis, machine translation, etc), you need more detailed features.
* __Sparsity__ If your vocabulary has 10,000 words but the document uses only 30 of them, your 10,000-dimensional vector will be almost all zeros. That sparsity can be beneficial: specialized data structures and algorithms can store and compute on these vectors very efficiently.
* __Deterministic__ With BoW, the same input and vocabulary always result in the same output. Unlike hashing, there’s no risk of two different words sharing a slot, so you never accidentally mix signals from unrelated tokens. Additionally, some hash functions (depending on their implementation) can behave non-deterministically, making them unsuitable for specific applications.

While the Bag-of-Words model is simple and effective, it has some limitations, including the omission of word order and context. To address these limitations, we can use __word embeddings__, which provide a more nuanced representation of words in a continuous vector space.
___

## Classical Embedding Models
The overall goal of [embedding models](https://en.wikipedia.org/wiki/Word_embedding) is to represent token sequences, e.g., characters, (sub)words, documents, etc., in a continuous vector space, where similar words are _close together_ in the embedding space. Let's look at some of the most popular embedding models: the continuous bag-of-words (CBOW) and skip-gram models. 
* _Models?_: The CBOW and skip-gram models are based on the idea that words that appear in similar contexts tend to have similar meanings. The CBOW model predicts a target word based on its context, while the skip-gram model does the opposite: it predicts the context given a target word. For now, we will focus on the CBOW model, which is simpler and easier to understand. See: [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)

_How does CBOW work?_

1. First, the input context vector $\mathbf{x}\in\mathbb{R}^{N_{\mathcal{V}}}$ is linearly transformed into a hidden (latent) vector $\mathbf{h}\in\mathbb{R}^{h}$, computed as: $\mathbf{h} = \mathbf{W}_{1}\;\mathbf{x}$, where $\mathbf{W}_{1}\in\mathbb{R}^{h\times{N_{\mathcal{V}}}}$ is an (unknown) weight matrix (that we must learn), and $\mathbf{x}$ is the one-hot encoded vector of context word(s) (the input vector). 
2. Next, the latent vector $\mathbf{h}$ is then transformed again: $\mathbf{u} = \mathbf{W}_{2}\;\mathbf{h}$, which produces the $\mathbf{u}\in\mathbb{R}^{N_{\mathcal{V}}}$ vector, where $\mathbf{W}_{2}\in\mathbb{R}^{N_{\mathcal{V}}\times{h}}$ is another (unknown) weight matrix for the output transformation. 
3. Finally, the output vector $\mathbf{u}$ is passed through a softmax function to obtain the probability distribution over the vocabulary:
    $$
    \begin{align*}
    p(t_{i}\mid \mathbf{x}) &= \frac{e^{(\mathbf{u})_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{(\mathbf{u})_j}} \\
    \end{align*}
    $$
    where $p(t_{i} \mid \mathbf{x})$ is the probability of observing token $t_i$, e.g., character, sub-word, word, document, etc in the vocabulary as the output (target) given the context vector $\mathbf{x}$, the term $N_{\mathcal{V}}$ is the size of the vocabulary, and $e^{(\mathbf{u})_i}$ is the exponential function applied to element $i$ of the vector $\mathbf{u}$.

### Training
The training objective of the CBOW model is to _maximize_ the likelihood of target token(s) given the context token(s). This is done by _minimizing_ the negative log-likelihood loss function (in this case, a weighted cross-entropy loss) over the training data. 

Let $\mathbf{y}$ be the one-hot encoded vector of the target token(s), i.e., the ground truth label from the training data ($y_{1} = 1$ is the target token, and $y_{i} = 0$ otherwise), and let $\mathbf{x}$ be the input context vector. The model learns to predict the target token(s) given the context token(s) by adjusting the weights $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ to minimize the loss function $\mathcal{L}$:
$$
\begin{align*}
\min\mathcal{L} &= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log p(t_{i} \mid \mathbf{x}) \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log \left( \frac{e^{(\mathbf{u})_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{(\mathbf{u})_j}} \right) \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left( (\mathbf{u})_i - \log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{(\mathbf{u})_j} \right) \right) \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{(\mathbf{u})_j} \right) -  (\mathbf{u})_i\right)\quad\text{substitute}~(\mathbf{u})_{i} = \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\;\mathbf{x}\rangle \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\langle \mathbf{w}_{2}^{(j)},\mathbf{W}_{1}\;\mathbf{x}\rangle} \right) -  \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\;\mathbf{x}\rangle\right)\blacksquare\\
\end{align*}
$$
where $y_{i}$ is the $i$-th element of the one-hot encoded vector of the target token(s) and $\langle \cdot,\cdot\rangle$ is the inner product. Finally, the term $\mathbf{w}_{2}^{(i)}$ is the $i$-th row of the weight matrix $\mathbf{W}_{2}$, which corresponds to the target token $t_{i}$.

A variety of optimization algorithms can be used to minimize the loss function. We'll implement the CBOW model and mess around with the inputs, hyperparameters, etc, to see how they affect its performance. Should be fun!
___