# From Statistical to Neural NLP

## Module 2 Transition: From Words to Embeddings

Welcome to **Module 2: LLM-based NLP**! Before we dive into transformers and large language models, we need to understand how we got here. This session bridges the gap between what you learned in Module 1 (Statistical NLP) and what you'll learn in Module 2 (Deep Learning NLP).

**What you'll learn in this session:** By the end of this hour, you'll understand the fundamental paradigm shifts in tokenization and vectorization that enabled modern NLP. You'll see how we evolved from simple word counting to dense semantic representations, and why this evolution was necessary for building powerful language models.

## Overview

This transition session covers three critical paradigm shifts:

1. **Tokenization Evolution**: From word-level to subword-level tokenization
2. **Vectorization Evolution**: From sparse (BoW/TF-IDF) to dense (embeddings) representations
3. **Static vs. Contextual Embeddings**: Understanding how modern models handle word meaning

These shifts are the foundation that makes transformers and LLMs possible. Understanding them will help you make informed decisions about when to use different approaches in your applications.

## Learning Objectives

By the end of this session, you will be able to:

- **Compare** word-level and subword-level tokenization approaches
- **Explain** why subword tokenization solves the out-of-vocabulary (OOV) problem
- **Understand** the conceptual shift from sparse to dense vector representations
- **Distinguish** between static and contextual embeddings
- **Recognize** when to use different tokenization and vectorization approaches
- **Connect** Module 1 concepts to Module 2 concepts

**Practical Goal**: You'll understand the "why" behind modern NLP tools, not just the "how", enabling you to make better technical decisions for your applications.

## Glossary of Terms

- **Tokenization**: The process of splitting text into smaller units (tokens)
- **Word-level tokenization**: Splitting text into complete words
- **Subword tokenization**: Splitting words into smaller units (subwords)
- **OOV (Out-of-Vocabulary)**: Words not seen during training
- **BPE (Byte Pair Encoding)**: A subword tokenization algorithm
- **WordPiece**: Another subword tokenization algorithm used by BERT
- **Sparse vector**: A high-dimensional vector with mostly zeros (e.g., BoW, TF-IDF)
- **Dense vector**: A lower-dimensional vector with mostly non-zero values (embeddings)
- **Static embedding**: A fixed vector representation for a word (same vector regardless of context)
- **Contextual embedding**: A vector representation that changes based on the word's context in a sentence
- **Distributional hypothesis**: The idea that words appearing in similar contexts have similar meanings

## Outline

1. **The Evolution of NLP Approaches** - Statistical NLP vs. Deep Learning NLP
2. **Tokenization Evolution** - From words to subwords
3. **The OOV Problem** - Why word-level tokenization fails
4. **Subword Tokenization** - How BPE and WordPiece solve OOV
5. **Vectorization Evolution** - From sparse to dense
6. **Why Dense Embeddings Matter** - Capturing semantic meaning
7. **Static vs. Contextual Embeddings** - The polysemy problem
8. **When to Use What** - Practical guidance for your applications

---

## 1. The Evolution of NLP Approaches

Let's start by understanding the big picture: how NLP evolved from statistical methods to deep learning.

![flat vector art diagram showing evolution from statistical NLP (left side with word clouds and TF-IDF formulas) to deep learning NLP (right side with neural network layers and transformer architecture), minimalist style, clean lines, white background, soft blue and orange accent colors](https://github.com/zeyad70/Bootcamp-week/blob/main/W5_NLP/M2/assets/nlp_evolution.png?raw=1)

### Module 1: Statistical NLP (What You Learned)

In Module 1, you learned **statistical NLP** approaches:

- **Text preprocessing**: Heavy cleaning and normalization
- **Tokenization**: Simple word-level splitting (whitespace-based)
- **Vectorization**: Sparse vectors (Bag of Words, TF-IDF)
- **Models**: Scikit-learn classifiers (logistic regression, naive bayes, etc.)
- **Philosophy**: "Super-fast librarian" - organize and retrieve based on word statistics

**Strengths**:
- Fast and efficient
- Interpretable (you can see which words matter)
- Works well for keyword-based tasks
- Low computational requirements

**Limitations**:
- Doesn't capture semantic relationships
- Struggles with context and ambiguity
- Can't handle words not in vocabulary
- Limited understanding of word order and syntax

### Module 2: Deep Learning NLP (What You'll Learn)

In Module 2, you'll learn **deep learning NLP** approaches:

- **Text preprocessing**: Minimal (mostly formatting)
- **Tokenization**: Subword-level (BPE, WordPiece)
- **Vectorization**: Dense embeddings (learned representations)
- **Models**: Transformers (BERT, GPT, etc.)
- **Philosophy**: learn meaning from context

**Strengths**:
- Captures semantic relationships
- Handles context and ambiguity
- Can process any text (subword tokenization)
- Understands word order and syntax
- Pre-trained on massive datasets

**Trade-offs**:
- Higher computational requirements
- Less interpretable
- Requires more data (or use pre-trained models)
- More complex architecture

### When to Use Each Approach?

**Use Statistical NLP (Module 1)** when:
- You need fast, interpretable results
- Your task is keyword-based (search, simple classification)
- You have limited computational resources
- You need to understand which words drive decisions

**Use Deep Learning NLP (Module 2)** when:
- You need semantic understanding
- Context matters (sentiment, translation, QA)
- You want to leverage pre-trained models
- You have access to GPUs/cloud resources
- You need state-of-the-art performance

**In practice**: Many systems use both! Statistical methods for initial filtering, deep learning for complex understanding.

---

## 2. Tokenization Evolution: From Words to Subwords

Let's dive into the first major paradigm shift: how we split text into tokens.

### Module 1: Word-Level Tokenization

In Module 1, you learned simple word-level tokenization:

```python
# Simple word-level tokenization (Module 1 approach)
text = "I love machine learning"
tokens = text.lower().split()  # ['i', 'love', 'machine', 'learning']
```

**How it works**:
- Split text by whitespace
- Optionally lowercase
- Each word becomes one token
- Simple and intuitive

**Example**:
- English: "Natural language processing" → `['natural', 'language', 'processing']`
- Arabic: "الذكاء الاصطناعي" → `['الذكاء', 'الاصطناعي']` (after proper Arabic tokenization)

#### Play around with Tokenizers

- https://tiktokenizer.vercel.app/
- https://platform.openai.com/tokenizer

In [1]:
# Let's see word-level tokenization in action (Module 1 style)
text = "I love machine learning and natural language processing"

# Simple word-level tokenization
tokens = text.lower().split()

print("Text:", text)
print("Tokens:", tokens)
print("Number of tokens:", len(tokens))
print("\nVocabulary (unique tokens):", set(tokens))

Text: I love machine learning and natural language processing
Tokens: ['i', 'love', 'machine', 'learning', 'and', 'natural', 'language', 'processing']
Number of tokens: 8

Vocabulary (unique tokens): {'love', 'learning', 'language', 'and', 'processing', 'i', 'natural', 'machine'}


### The Problem: Out-of-Vocabulary (OOV) Words

Word-level tokenization has a critical limitation: **what happens when you encounter a word you've never seen?**

**Scenario**: You trained a model on a vocabulary of 10,000 words. Now you see:
- "unhappiness" (not in vocabulary)
- "pre-trained" (not in vocabulary)
- "BERT" (not in vocabulary - it's a proper noun)

**Word-level approach**: These words become `<UNK>` (unknown token) - you lose all information!

**The vocabulary explosion problem**:
- English has ~170,000 words
- But with inflections, compounds, and new words, the vocabulary grows infinitely
- You can't include every possible word in your vocabulary
- New words appear constantly (slang, technical terms, proper nouns)

In [12]:
# Vocabulary from training data
training_vocab = {
    'i', 'love', 'machine', 'learning', 'natural', 'language',
    'processing', 'happy', 'sad', 'good', 'bad'}

# New text with unseen words
new_text = "I feel unhappiness about the pre-trained model"

# Word-level tokenization
tokens = new_text.lower().split()

print("New text:", new_text)
print("\nChecking vocabulary:")
for token in tokens:
    if token in training_vocab:
        print(f"  '{token}' ✓ (in vocabulary)")
    else:
        print(f"  '{token}' ✗ (OOV - becomes <UNK>)")

print("\nConclusion: Standard word-level tokenization fails to recognize 'unhappiness' and 'pre-trained'.")

New text: I feel unhappiness about the pre-trained model

Checking vocabulary:
  'i' ✓ (in vocabulary)
  'feel' ✗ (OOV - becomes <UNK>)
  'unhappiness' ✗ (OOV - becomes <UNK>)
  'about' ✗ (OOV - becomes <UNK>)
  'the' ✗ (OOV - becomes <UNK>)
  'pre-trained' ✗ (OOV - becomes <UNK>)
  'model' ✗ (OOV - becomes <UNK>)

Conclusion: Standard word-level tokenization fails to recognize 'unhappiness' and 'pre-trained'.


### Module 2: Subword Tokenization

**Solution**: Instead of treating words as atomic units, break them into smaller pieces (subwords)!

**Key insight**: Most words are made of smaller, reusable pieces:
- "unhappiness" = "un" + "happi" + "ness"
- "pre-trained" = "pre" + "-" + "train" + "ed"
- "BERT" = "BERT" (or could be "B" + "ER" + "T" if needed)

**Benefits**:
- Handle any word by combining subwords
- Smaller vocabulary (reuse subwords across words)
- Better for morphologically rich languages (like Arabic)
- Can represent new words without `<UNK>`

![flat vector art diagram showing subword tokenization breaking 'unhappiness' into 'un', 'happi', 'ness' subwords, with checkmarks, minimalist style, clean lines, white background, soft blue and orange accent colors](https://github.com/zeyad70/Bootcamp-week/blob/main/W5_NLP/M2/assets/subword_solution.png?raw=1)

### How Subword Tokenization Works

**Two main algorithms**:

1. **BPE (Byte Pair Encoding)**:
   - Start with characters
   - Iteratively merge most frequent pairs

2. **WordPiece**: Similar to BPE but optimized for language modeling

**Example**: "unhappiness"
- Word-level: `['unhappiness']` → OOV if not in vocab
- Subword-level: `['un', '##happi', '##ness']` → All subwords likely in vocab!

**The `##` prefix**: In WordPiece, `##` indicates this subword continues from the previous one.

**Arabic example**: "الذكاء" (intelligence)
- Could be split into: `['ال', '##ذكاء']` or `['الذكاء']` depending on the tokenizer
- Subwords help handle Arabic's rich morphology

In [13]:
text = "I feel unhappiness about the pre-trained model"

# Word-level (Module 1 style)
word_tokens = text.lower().split()
print("Word-level tokenization:")
print(f"  Tokens: {word_tokens}")

# Subword-level (Module 2 style - Simulation)
# Breaking words like 'unhappiness' into meaningful sub-units
subword_tokens = ['i', 'feel', 'un', '##happi', '##ness', 'about', 'the',
                  'pre', '-', 'train', '##ed', 'model']

print("\nSubword-level tokenization:")
print(f"  Tokens: {subword_tokens}")
print(f"  Count: {len(subword_tokens)}")
print("\nKey insight: Subwords allow the model to understand 'un' (negation) and 'happi' (root) separately.")

Word-level tokenization:
  Tokens: ['i', 'feel', 'unhappiness', 'about', 'the', 'pre-trained', 'model']

Subword-level tokenization:
  Tokens: ['i', 'feel', 'un', '##happi', '##ness', 'about', 'the', 'pre', '-', 'train', '##ed', 'model']
  Count: 12

Key insight: Subwords allow the model to understand 'un' (negation) and 'happi' (root) separately.


### Comparison: Word-Level vs. Subword-Level

| Aspect | Word-Level (M1) | Subword-Level (M2) |
|--------|----------------|-------------------|
| **Vocabulary size** | Large (10K-100K words) | Smaller (30K-50K subwords) |
| **OOV handling** | `<UNK>` token (loses info) | Breaks into subwords (preserves info) |
| **New words** | Cannot handle | Can handle by combining subwords |
| **Morphology** | Treats each form separately | Shares subwords across forms |
| **Complexity** | Simple (split by space) | More complex (learned algorithm) |
| **Use case** | Statistical NLP, keyword search | Deep learning, transformers |

**Takeaway**: Subword tokenization is essential for deep learning models because they need to handle any text, not just pre-seen words.

---

## 3. Vectorization Evolution: From Sparse to Dense

Now let's explore the second major paradigm shift: how we represent text as numbers.

### Module 1: Sparse Vectorization (BoW, TF-IDF)

In Module 1, you learned **sparse vectorization**:

**Bag of Words (BoW)**:
- Count how many times each word appears
- Create a vector with one dimension per word in vocabulary
- Most values are 0 (sparse!)

**Example**:
- Vocabulary: `['i', 'love', 'machine', 'learning', 'python']` (5 words)
- Text: "I love machine learning"
- Vector: `[1, 1, 1, 1, 0]` (one 1 for each word present, 0 for 'python')

**TF-IDF**: Similar, but weights words by importance (rare words get higher weights)

**Characteristics**:
- **High-dimensional**: One dimension per word (10K-100K dimensions)
- **Sparse**: Most values are 0 (each document uses only ~100-200 words)
- **Interpretable**: You can see which words are present
- **No semantics**: "cat" and "dog" are as different as "cat" and "banana"

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Small corpus
corpus = [
    "I love machine learning",
    "I love python programming",
    "Machine learning is fun"
]

# Create sparse BoW vectors
vectorizer = CountVectorizer()
sparse_matrix = vectorizer.fit_transform(corpus)

# Get vocabulary and display matrix
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", list(vocab))
print("\nDense representation (Sparse Matrix as Array):")
print(sparse_matrix.toarray())

Vocabulary: ['fun', 'is', 'learning', 'love', 'machine', 'programming', 'python']

Dense representation (Sparse Matrix as Array):
[[0 0 1 1 1 0 0]
 [0 0 0 1 0 1 1]
 [1 1 1 0 1 0 0]]


### The Problem: Sparse Vectors Don't Capture Meaning

**Key limitation**: Sparse vectors treat words as independent. They don't capture relationships!

**Example**:
- "cat" → `[0, 0, 1, 0, 0, ...]` (1 in 'cat' dimension, 0 elsewhere)
- "dog" → `[0, 1, 0, 0, 0, ...]` (1 in 'dog' dimension, 0 elsewhere)
- "banana" → `[1, 0, 0, 0, 0, ...]` (1 in 'banana' dimension, 0 elsewhere)

**Distance between vectors**:
- Distance("cat", "dog") = 2 (they differ in 2 dimensions)
- Distance("cat", "banana") = 2 (same distance!)

**But semantically**: "cat" and "dog" are much more similar than "cat" and "banana"!

**The problem**: Sparse vectors can't represent that "cat" and "dog" are both:
- Animals
- Pets
- Mammals
- Four-legged
- etc.

## Distributional Hypothesis

> "You shall know a word by the company it keeps." — J.R. Firth, 1957

Here's the key insight: **words that appear near similar words have similar meanings.**

**English examples**:
- "Dog" and "cat" both appear near: "pet", "animal", "furry", "feed"
- "Bank" (financial) appears near: "money", "deposit", "account", "loan"
- "Bank" (river) appears near: "river", "water", "shore", "fishing"

**Arabic examples**:
- "كلب" (dog) and "قطة" (cat) both appear near: "حيوان" (animal), "أليف" (pet), "طعام" (food)
- "عين" (eye) appears near: "نظر" (look), "رؤية" (vision), "وجه" (face)
- "عين" (spring) appears near: "ماء" (water), "نبع" (source), "شرب" (drink)

**The Distributional Hypothesis**: Words with similar distributions (contexts) have similar meanings.

Let's see how we can learn word meaning from context alone. This demonstrates the core idea behind embeddings.

#### Discovering Meaning from Context

Imagine you encounter a word you've never seen. How would you figure out what it means? You'd look at the words around it—the context.

**Arabic example** ("كتاب" - kitab - book):
If we see contexts like:
- "قرأت الكتاب" (I read the **book**)
- "كتب في الكتاب" (He wrote in the **book**)
- "اشترى كتاباً جديداً" (He bought a new **book**)
- "الكتاب مفيد" (The **book** is useful)

Which gives us these many meanings to the wrod "كتاب":

- It's something that can be **read**
- It can be **written in**
- It can be **bought/sold**
- It can be **useful or not**
- It's a **noun** (used as an object)

Let's try this with a real example from `text1` in `nltk`:

In [15]:
# Let's see how we can learn word meaning from context
import nltk
nltk.download("book", quiet=True)
from nltk.book import text1, text2

# Displaying context for the word 'apple' in Moby Dick
print("Contexts where 'apple' appears in Text 1:")
print("=" * 70)
text1.concordance("apple", width=80, lines=5)

# Displaying context for the word 'man' in Sense and Sensibility
print("\nContexts where 'man' appears in Text 2:")
print("=" * 70)
text2.concordance("man", width=80, lines=5)

Contexts where 'apple' appears in Text 1:
Displaying 4 of 4 matches:
 an idea first born on an undigested apple - dumpling ; and since then perpetua
 the whale , is much like halving an apple ; there is no intermediate remainder
e of an anchor , or the crotch of an apple tree ), and then giving the word , h
shook , and cast his last , cindered apple to the soil . " What is it , what na

Contexts where 'man' appears in Text 2:
Displaying 5 of 121 matches:
ate owner of this estate was a single man , who lived to a very advanced age , 
 The son , a steady respectable young man , was amply provided for by the fortu
 . He was not an ill - disposed young man , unless to be rather cold hearted an
 to exist between the children of any man by different marriages ; and why was 
a gentleman - like and pleasing young man , who was introduced to their acquain


In [6]:
from nltk.book import text1

# Show all instances of "apple" with surrounding context
print("Contexts where 'apple' appears:")
print("=" * 70)
text1.concordance("apple", width=80, lines=10)

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Contexts where 'apple' appears:
Displaying 4 of 4 matches:
 an idea first born on an undigested apple - dumpling ; and since then perpetua
 the whale , is much like halving an apple ; there is no intermediate remainder
e of an anchor , or the crotch of an apple tree ), and then giving the word , h
shook , and cast his last , cindered apple to the soil . " What is it , what na


The **English example** above ("apple"):
From just these few examples, we can infer:
- It can be **eaten** ("undigested apple")
- It can be **cut** ("halving an apple")
- It comes from a **tree** ("apple tree")
- It's a **noun** (used as an object)

> **Key insight**: The more contexts we see, the clearer the meaning becomes.

In [7]:
from nltk.book import text2

# Show all instances of "man" with surrounding context
print("Contexts where 'man' appears:")
print("=" * 70)
text2.concordance("man", width=80, lines=10)

Contexts where 'man' appears:
Displaying 10 of 121 matches:
ate owner of this estate was a single man , who lived to a very advanced age , 
 The son , a steady respectable young man , was amply provided for by the fortu
 . He was not an ill - disposed young man , unless to be rather cold hearted an
 to exist between the children of any man by different marriages ; and why was 
a gentleman - like and pleasing young man , who was introduced to their acquain
dward Ferrars was the eldest son of a man who had died very rich ; and some mig
her established ideas of what a young man ' s address ought to be , was no long
ut yet -- he is not the kind of young man -- there is something wanting -- his 
at grace which I should expect in the man who could seriously attach my sister 
 united . I could not be happy with a man whose taste did not in every point co



**This is what Dense Embeddings do** (explained next): They aggregate information from millions of contexts to build rich vector representations. More training data → better embeddings → better performance in your applications.

### Module 2: Dense Embeddings

**Solution**: Learn dense, low-dimensional vectors that capture semantic meaning!

**How it works**:
1. Train a neural network on millions of texts
2. The network learns to predict words from context (or context from words)
3. The learned internal representations become word embeddings
4. Similar words end up with similar vectors

**Characteristics**:
- **Low-dimensional**: 100-768 dimensions (vs. 10K-100K for sparse)
- **Dense**: Most values are non-zero
- **Semantic**: Similar words have similar vectors
- **Learned**: Representations come from training, not manual design

### Example: Dense Embeddings Capture Semantic Relationships

**Sparse vectors (Module 1)**:

* "king": `[0, 0, 1, 0, ...]` (Treats "king" as a unique ID)
* "queen": `[0, 0, 0, 1, ...]` (Treats "queen" as a unique ID)
* **Result:** Mathematically, they are completely different (orthogonal). The model doesn't know they are related.

**Dense embeddings (Module 2)**:

* "king": `[0.95, -0.8, 0.1, ...]` (High "royalty", Male)
* "man": `[0.05, -0.9, 0.2, ...]` (Low "royalty", Male)
* "woman": `[0.06, 0.85, 0.3, ...]` (Low "royalty", Female)
* **Result:** The vectors capture relationships. If you perform the math operation: `king - man + woman`, the resulting vector is incredibly close to `queen`.

**What the dimensions mean**:
Instead of just matching keywords, the model has learned concepts:

* One dimension might represent **"Gender"** (Positive for female, negative for male).
* Another might represent **"Royalty"** (High for King/Queen, low for Man/Woman).
* The relationship is encoded geometrically in the vector space.


![Words to Embedding Vectors to 3D Visual (King-Queen : Man-Woman)](https://github.com/zeyad70/Bootcamp-week/blob/main/W5_NLP/M2/assets/words_embeddings_visualized.png?raw=1)

#### Embeddings Capture Syntactic, Semantic, and other Relationships

When visualized in 2D, embeddings show clear structure:

<img src="https://miro.medium.com/v2/resize:fit:678/1*5F4TXdFYwqi-BWTToQPIfg.jpeg">

**The model learns these dimensions automatically** from data—we don't specify what each dimension means. The relationships between vectors are what matter.

### Cosine Similarity

**Cosine Similarity** measures the angle between two vectors (word embeddings):

- It ranges from -1 to 1
- **1.0** = vectors point in the same direction (**~same**)
- **0.0** = vectors are perpendicular (**no relationship / indifference**)
- **-1.0** = vectors point in opposite directions (**~opposite**)

![](https://github.com/zeyad70/Bootcamp-week/blob/main/W5_NLP/M2/assets/cosine_similarity_3_cases.png?raw=1)

### Comparison: Sparse vs. Dense Vectors

| Aspect | Sparse (M1) | Dense (M2) |
|--------|-------------|-----------|
| **Dimensions** | 10K-100K (vocab size) | 100-768 (fixed) |
| **Density** | ~1-2% non-zero | ~100% non-zero |
| **Semantics** | No (words independent) | Yes (similar words close) |
| **Memory** | Efficient (sparse storage) | More memory per vector |
| **Interpretability** | High (see which words) | Low (dimensions abstract) |
| **Learning** | Statistical (count/weight) | Neural (learned from data) |
| **Use case** | Keyword search, simple classification | Semantic understanding, similarity |

**Takeaway**: Dense embeddings enable semantic understanding that sparse vectors cannot provide. This is why modern NLP uses embeddings!

---

## 4. Static vs. Contextual Embeddings

The final paradigm shift: understanding how word meaning changes with context.

### The Polysemy Problem

Many words have multiple meanings depending on context:

**English examples**:
- "bank": financial institution vs. river edge
- "bark": tree covering vs. dog sound
- "bat": flying mammal vs. sports equipment

**Arabic examples**:
- "عين" (ayn): eye vs. spring/water source
- "ساق" (saq): leg vs. stem (of plant)
- "رأس" (ra's): head vs. beginning/start

**The challenge**: How do we represent words that mean different things in different contexts?

### Static Embeddings (Early Deep Learning)

**Static embeddings** (Word2Vec, GloVe):
- One vector per word, regardless of context
- "bank" always has the same vector
- Learned from word co-occurrence patterns

**Limitation**:
- "I deposited money at the **bank**" → same vector as "We sat by the river **bank**"
- The model can't distinguish between meanings!

**When it works**:
- When context doesn't matter much
- For general semantic similarity
- When words have consistent meanings

In [16]:
# Demonstrating static embeddings limitation

# Simulated static embeddings (Same vector for 'bank' regardless of meaning)
static_vector_for_bank = [0.34, -0.12, 0.88, 0.56]

context1 = "I deposited money at the bank"
context2 = "We sat by the river bank"

# In static embeddings, both get the same representation
print("Static Embeddings Analysis:")
print(f"Vector for 'bank' in Context 1: {static_vector_for_bank}")
print(f"Vector for 'bank' in Context 2: {static_vector_for_bank}")

print("\nConclusion: Static embeddings cannot distinguish between a financial bank and a river bank.")

Static Embeddings Analysis:
Vector for 'bank' in Context 1: [0.34, -0.12, 0.88, 0.56]
Vector for 'bank' in Context 2: [0.34, -0.12, 0.88, 0.56]

Conclusion: Static embeddings cannot distinguish between a financial bank and a river bank.


- Problem: Model can't distinguish between meanings
- Solution: Contextual embeddings (next section)

### Contextual Embeddings (Modern Transformers)

**Contextual embeddings** (BERT, GPT, modern transformers):
- Vector changes based on the word's context in the sentence
- "bank" in financial context → different vector than "bank" in river context
- Learned by processing entire sentences through transformer layers

**How it works**:
1. Tokenize the sentence
2. Pass through transformer layers (self-attention)
3. Each token gets a vector that depends on all other tokens
4. Same word, different contexts → different vectors!

**Advantage**:
- "I deposited money at the **bank**" → vector close to "money", "deposit", "account"
- "We sat by the river **bank**" → vector close to "river", "water", "shore"
- The model can distinguish meanings!

![flat vector art diagram showing contextual embeddings where 'bank' has different vector representations in 'financial bank' context (near money/deposit vectors) vs 'river bank' context (near river/water vectors), minimalist style, clean lines, white background, soft blue and orange accent colors](https://github.com/zeyad70/Bootcamp-week/blob/main/W5_NLP/M2/assets/contextual_embeddings_1.png?raw=1)

![](https://github.com/zeyad70/Bootcamp-week/blob/main/W5_NLP/M2/assets/contextual_embeddings_2.png?raw=1)

### Comparison: Static vs. Contextual Embeddings

| Aspect | Static (Word2Vec, GloVe) | Contextual (BERT, GPT) |
|--------|-------------------------|----------------------|
| **Vector per word** | One (fixed) | Many (depends on context) |
| **Polysemy handling** | No (same vector) | Yes (different vectors) |
| **Context awareness** | No | Yes (considers sentence) |
| **Computational cost** | Low (lookup) | Higher (neural processing) |
| **Use case** | General similarity, simple tasks | Understanding, complex tasks |
| **When to use** | Fast semantic search, word similarity | Sentiment, translation, QA |

**Takeaway**: Contextual embeddings are essential for understanding words with multiple meanings. This is why transformers (BERT, GPT) are so powerful!

## The Challenge of Semantic Shift

Here's something important for real-world applications: **word meanings evolve**. This is called **semantic shift** or **semantic change**.

**Why this matters**: If you train a model on data in many contexts, but leave out others that relate to your problem domain (like medicine, law, religion), it will misunderstand.

![Figure: How three words changed their meanings over time. Notice how their "neighbors" in semantic space shifted!](https://github.com/zeyad70/Bootcamp-week/blob/main/W5_NLP/assets/semantic_shift.png?raw=1)

*Figure: How three words changed their meanings over time. Notice how their "neighbors" in semantic space shifted!*

#### Case Study 1: The Evolution of "Broadcast" (English)

Let's trace how one word's meaning changed over time. This shows how distributional semantics captures meaning change:

**1850s: The Agricultural Era**
- **Context**: "The farmer broadcast seeds across the field"
- **Neighbors**: seed, sow, scatter, spread, field, harvest
- **Meaning**: To cast seeds broadly by hand

**1900s: The Metaphorical Shift**
- **Context**: "Newspapers broadcast news to the masses"
- **Neighbors**: news, information, circulate, distribute, media
- **Meaning**: Still connected to "scattering widely", but now about information

**1990s: The Modern Era**
- **Context**: "The BBC will broadcast the news at 6 PM"
- **Neighbors**: television, radio, network, program, channel
- **Meaning**: Firmly attached to mass media

**What happened?** The word's **distributional neighbors** changed. In 1850, "broadcast" appeared near farming words. By 1990, it appeared near media words. The meaning shifted because the context shifted.

#### Case Study 2: The Evolution of "هاتف" (Hatif - Telephone) in Arabic

Arabic also shows fascinating semantic shifts. Let's look at **"هاتف"** (hatif):

**Classical Arabic (pre-20th century)**:
- **Context**: "سمعت هاتفاً ينادي" (I heard a **voice from the unseen** calling)
- **Neighbors**: صوت (voice), غيب (unseen), نداء (call), خفي (hidden)
- **Meaning**: An invisible voice or caller (often in poetry/mystical contexts)

**Early 20th Century (Technology Introduction)**:
- **Context**: "استخدمت الهاتف للاتصال" (I used the **telephone** to call)
- **Neighbors**: اتصال (connection), مكالمة (call), سلك (wire), جهاز (device)
- **Meaning**: The new technology - telephone

**Modern Arabic (21st century)**:
- **Context**: "هاتفي الذكي" (my **smartphone**)
- **Neighbors**: تطبيق (app), إنترنت (internet), شاشة (screen), ذكي (smart)
- **Meaning**: Expanded to include all phone types, especially smartphones

**What happened?** The word's **distributional neighbors** shifted from mystical/poetic contexts to technological contexts. The meaning evolved from "invisible voice" to "telephone device" to "smartphone"—a complete semantic transformation!


#### Case Study 3: The Evolution of "شبكة" (Shabaka - Network) in Arabic

Another example: **"شبكة"** (shabaka):

**Traditional meaning**:
- **Context**: "ألقى الصياد الشبكة في البحر" (The fisherman cast the **net** into the sea)
- **Neighbors**: صيد (fishing), بحر (sea), سمك (fish), خيط (thread)
- **Meaning**: A fishing net

**Modern meaning**:
- **Context**: "اتصلت بالشبكة" (I connected to the **network**)
- **Neighbors**: إنترنت (internet), اتصال (connection), واي فاي (WiFi), بيانات (data)
- **Meaning**: Computer/internet network

**The shift**: From physical net (fishing) to abstract network (technology). The word kept its core concept of "interconnected structure" but applied it to a completely different domain.

---

## Key Takeaways

1. **Tokenization Evolution**:
   - Module 1: Word-level tokenization (simple, but fails on OOV words)
   - Module 2: Subword tokenization (handles any text by breaking words into pieces)
   - **Why**: Enables models to process any text, not just pre-seen words

2. **Vectorization Evolution**:
   - Module 1: Sparse vectors (BoW, TF-IDF) - high-dimensional, no semantics
   - Module 2: Dense embeddings - low-dimensional, captures semantic meaning
   - **Why**: Enables semantic understanding and similarity calculations

3. **Static vs. Contextual Embeddings**:
   - Static: One vector per word (fails on polysemy)
   - Contextual: Vector depends on context (handles multiple meanings)
   - **Why**: Enables understanding words with multiple meanings

4. **The Big Picture**:
   - These paradigm shifts enabled transformers and LLMs
   - Each approach has its place (use the right tool for the job)
   - Modern NLP combines multiple approaches for best results

## References

1. **Subword Tokenization**:
   - Sennrich, R., et al. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL.
   - Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL.

2. **Word Embeddings**:
   - Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." ICLR.
   - Pennington, J., et al. (2014). "GloVe: Global Vectors for Word Representation." EMNLP.

3. **Contextual Embeddings**:
   - Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL.
   - Vaswani, A., et al. (2017). "Attention Is All You Need." NIPS.

4. **Distributional Hypothesis**:
   - Firth, J.R. (1957). "A Synopsis of Linguistic Theory." Studies in Linguistic Analysis.