# üìò **Text Representation | NLP Lecture 4**

### Topics: Bag of Words | TF-IDF | N-grams (Uni-grams, Bi-grams)


## I. **Introduction to Text Representation**

**Goal:** Convert text data into numerical form (a process also known as *Text Vectorisation* or *Feature Extraction from Text*) so that ML algorithms can process it.

### üîπ Importance

* **Feature Quality:** The effectiveness of ML models heavily depends on feature quality ‚Äî *‚ÄúGarbage in, garbage out.‚Äù*
* **NLP Pipeline:** Follows data acquisition and pre-processing; crucial for ML-based NLP.
* **Objective:** Numerical representation should capture the *semantic meaning* of text.

### üîπ Challenges

* Text ‚Üí Numbers is hard (unlike image or audio data).
* Requires intelligent mapping of linguistic meaning to mathematical form.

### üîπ Techniques Covered

1. One-Hot Encoding
2. Bag of Words (BoW)
3. N-grams (Uni, Bi, Tri-grams)
4. TF‚ÄìIDF (Term Frequency‚ÄìInverse Document Frequency)
5. Custom Features
   *(Future: Word Embeddings ‚Äì Word2Vec, GloVe, etc.)*

---

## II. **Key Terminology**

| Term                     | Description                                      |
| ------------------------ | ------------------------------------------------ |
| **Corpus (C)**           | All words from all documents (including repeats) |
| **Vocabulary (V)**       | Set of *unique* words from the corpus            |
| **Document (D)**         | A single text unit (sentence, paragraph, etc.)   |
| **Word (W) or Term (T)** | A single token in a document                     |

---

#### **1. One-Hot Encoding (1950s‚Äì1960s, very early NLP)**

* **Idea:** Represent each word as a binary vector ‚Äî all 0s except for a single 1 in the position corresponding to that word in the vocabulary.
* **Example:**
  Vocabulary: ["cat", "dog", "fish"]
  ‚Äúdog‚Äù ‚Üí [0, **1**, 0]
* **Limitation:**

  * Ignores context and meaning
  * High dimensional and sparse
- ‚ö†Ô∏è Drawbacks

üîπ Disadvantages

- ‚ùå Sparsity: Huge, mostly-zero matrices.
- ‚ùå Non-fixed size: Different document lengths = different input sizes.
- ‚ùå OOV problem: New words can‚Äôt be represented.
- ‚ùå No semantics: ‚Äúwalk‚Äù and ‚Äúrun‚Äù are equally distant from ‚Äúbottle‚Äù.


In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

corpus = ["dog barks", "cat meows", "dog runs"]
words = list(set(" ".join(corpus).split()))  # unique vocab
print("Vocabulary:", words)

# Manual one-hot encoding
import numpy as np
V = len(words)
encoding = {word: np.eye(V)[i] for i, word in enumerate(words)}
encoding


Vocabulary: ['barks', 'meows', 'dog', 'cat', 'runs']


{'barks': array([1., 0., 0., 0., 0.]),
 'meows': array([0., 1., 0., 0., 0.]),
 'dog': array([0., 0., 1., 0., 0.]),
 'cat': array([0., 0., 0., 1., 0.]),
 'runs': array([0., 0., 0., 0., 1.])}

---

#### **2. Bag of Words (BoW) (1980s‚Äì1990s)**

* **Idea:** Represent a document by word **counts** (or frequencies) ‚Äî order of words is ignored.

- Vocabulary from corpus ‚Üí Each document = vector of word frequencies.
- Ignores order ‚Üí treats text as a bag of words.
- Used widely for text classification.

* **Example:**
  ‚ÄúDog bites man‚Äù and ‚ÄúMan bites dog‚Äù ‚Üí same vector (same counts).
* **Limitation:**

#### üîπ Advantages

- ‚úÖ Fixed-size vector for any document
- ‚úÖ Tolerates unseen words (ignored at inference)/OOV words ignored gracefully

#### üîπ Disadvantages

- ‚ùå Sparse vectors/Large vocabulary ‚Üí sparse vectors
- ‚ùå Loses word order/Ignores word order/context (syntax)
- ‚ùå Fails with negation ‚Äî ‚Äúgood‚Äù vs. ‚Äúnot good‚Äù appear similar


* üîπ Scikit-learn:
- CountVectorizer()

* üîπ Key Parameters
  - binary=True: presence (1/0) instead of count.
  - max_features: limit vocabulary to top-N frequent words.
  - stop_words: remove common stopwords.


In [3]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The cat chased the mouse"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print(pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()))



Vocabulary: ['cat' 'chased' 'dog' 'log' 'mat' 'mouse' 'on' 'sat' 'the']
   cat  chased  dog  log  mat  mouse  on  sat  the
0    1       0    0    0    1      0   1    1    2
1    0       0    1    1    0      0   1    1    2
2    1       1    0    0    0      1   0    0    2


In [None]:
# ‚öôÔ∏è Key Parameters
vectorizer = CountVectorizer(
    binary=True,             # presence/absence instead of counts
    max_features=10,         # limit vocabulary size
    stop_words='english',    # remove stopwords
)

#### **3. N-grams (Unigram, Bigram, Trigram) (1990s‚Äì2000s)**

* **Idea:** Capture **local word order** by looking at sequences of *n* words.
## * üìò Concept

- Extends BoW by including sequences of N words.
- Captures local word order and context.

* **Examples:**

  * Unigrams: ‚Äúdog‚Äù, ‚Äúbites‚Äù, ‚Äúman‚Äù
  * Bigrams: ‚Äúdog bites‚Äù, ‚Äúbites man‚Äù
  * Trigrams: ‚Äúdog bites man‚Äù
* **Benefit:** Adds some context awareness.
- Captures context and short phrases

- Helps handle negations and idioms

* **Limitation:** Still sparse, grows combinatorially with *n*.
  - Vocabulary size grows fast (computational cost)
  - Still sparse and OOV issues remain

In [4]:
# Bi-gram example
bi_vectorizer = CountVectorizer(ngram_range=(2,2))
X_bi = bi_vectorizer.fit_transform(corpus)

print("Bi-gram Vocabulary:", bi_vectorizer.get_feature_names_out())
print(pd.DataFrame(X_bi.toarray(), columns=bi_vectorizer.get_feature_names_out()))

# Uni + Bi + Tri-grams
combo_vectorizer = CountVectorizer(ngram_range=(1,3))
X_combo = combo_vectorizer.fit_transform(corpus)
print("Combined Vocabulary Size:", len(combo_vectorizer.get_feature_names_out()))

# ngram_range=(2,2): This specific argument sets the lower and upper bounds for the n-value to be extracted. The first 2 is the minimum value for \(n\) (unigrams, which are single words, are excluded). The second 2 is the maximum value for \(n\) (only bigrams, which are pairs of words, are included).


Bi-gram Vocabulary: ['cat chased' 'cat sat' 'chased the' 'dog sat' 'on the' 'sat on' 'the cat'
 'the dog' 'the log' 'the mat' 'the mouse']
   cat chased  cat sat  chased the  dog sat  on the  sat on  the cat  the dog  \
0           0        1           0        0       1       1        1        0   
1           0        0           0        1       1       1        0        1   
2           1        0           1        0       0       0        1        0   

   the log  the mat  the mouse  
0        0        1          0  
1        1        0          0  
2        0        0          1  
Combined Vocabulary Size: 30


In [None]:
print(X_combo )

---

#### **4. TF‚ÄìIDF (Term Frequency‚ÄìInverse Document Frequency) (1990s‚Äì2000s)**

* **Idea:** Weigh words by importance ‚Äî frequent in a document but rare across the corpus.
- Assigns importance weights to words instead of raw counts.
* **Formula:**
  TF-IDF(T,D)=TF(T,D)√óIDF(T)

- TF= Word occurrences in doc/Total words in doc'

- Inverse Document Frequency (IDF):
- IDF=log(N/n_T)

Weight = TF √ó IDF
* **Benefit:** Reduces impact of common words like ‚Äúthe‚Äù, ‚Äúand‚Äù.
* **Limitation:** Still based on counts, no semantic understanding.

* ‚úÖ Advantages

Reduces weight of common words
Useful for information retrieval (e.g., search engines)

* ‚ùå Disadvantages

- Sparse matrix
- OOV issue
- Still no deep semantic relation captured

---

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The cat chased the mouse"
]

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)

print("Vocabulary:", tfidf.get_feature_names_out())
print(pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out()))


Vocabulary: ['cat' 'chased' 'dog' 'log' 'mat' 'mouse' 'on' 'sat' 'the']
        cat    chased       dog       log       mat     mouse        on  \
0  0.374207  0.000000  0.000000  0.000000  0.492038  0.000000  0.374207   
1  0.000000  0.000000  0.468699  0.468699  0.000000  0.000000  0.356457   
2  0.381519  0.501651  0.000000  0.000000  0.000000  0.501651  0.000000   

        sat       the  
0  0.374207  0.581211  
1  0.356457  0.553642  
2  0.000000  0.592567  


#### **5. Word Embeddings (Word2Vec, GloVe, FastText, etc.) (2013 onward)**

* **Idea:** Learn **dense, low-dimensional vectors** where similar words are close in vector space.
* **Example:**
  Vector(‚Äúking‚Äù) ‚Äì Vector(‚Äúman‚Äù) + Vector(‚Äúwoman‚Äù) ‚âà Vector(‚Äúqueen‚Äù)
* **Benefit:** Captures **semantic meaning** and **relationships** between words.
* **Limitation:** Fixed for each word ‚Äî context-independent.

---

https://www.cs.cmu.edu/~dst/WordEmbeddingDemo/tutorial.html

# üß† **1. Word2Vec**

### üìò **Definition**

**Word2Vec** (by Google, 2013) is a **neural embedding model** that learns to represent words as dense vectors.
It uses two main architectures:

* **CBOW (Continuous Bag of Words):** Predicts a word based on its context.
* **Skip-Gram:** Predicts context words given a target word.

These embeddings capture **semantic and syntactic relationships** between words ‚Äî e.g.,
`vector("king") - vector("man") + vector("woman") ‚âà vector("queen")`.

---

### ‚úÖ **Advantages**

* Captures both **semantic** and **syntactic** relationships.
* Trains efficiently on large datasets.
* Performs well in many NLP downstream tasks.

### ‚ùå **Disadvantages**

* Doesn‚Äôt handle **out-of-vocabulary (OOV)** words.
* Ignores **subword (morphological)** information.
* Embeddings depend heavily on training data quality.


In [9]:
### üíª **Code Example**
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [10]:


from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Sample corpus
sentences = [
    "I love deep learning and natural language processing",
    "Word embeddings are useful for NLP tasks",
    "Word2Vec is a great model for learning word vectors"
]

# Preprocessing
tokenized_sentences = [simple_preprocess(sentence) for sentence in sentences]

# Train Word2Vec model
model_w2v = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, sg=1)

# Check vector and similar words
print(model_w2v.wv["learning"])
print(model_w2v.wv.most_similar("learning"))


[ 8.2854225e-05  3.0804786e-03 -6.8171369e-03 -1.3819924e-03
  7.6688523e-03  7.3485742e-03 -3.6703344e-03  2.6495254e-03
 -8.3272625e-03  6.2052459e-03 -4.6446035e-03 -3.1655754e-03
  9.3013998e-03  8.7241223e-04  7.4875425e-03 -6.0798707e-03
  5.1585971e-03  9.9286158e-03 -8.4559023e-03 -5.1436499e-03
 -7.0698522e-03 -4.8727528e-03 -3.7639842e-03 -8.5298754e-03
  7.9524172e-03 -4.8443628e-03  8.4252423e-03  5.2630277e-03
 -6.5567652e-03  3.9652688e-03  5.4830597e-03 -7.4314261e-03
 -7.4071982e-03 -2.4881389e-03 -8.6184945e-03 -1.5801950e-03
 -4.0058923e-04  3.3071812e-03  1.4464642e-03 -8.7997329e-04
 -5.5853585e-03  1.7180548e-03 -9.1300579e-04  6.8013240e-03
  3.9767604e-03  4.5251343e-03  1.4379798e-03 -2.6973798e-03
 -4.3649534e-03 -1.0382878e-03  1.4438222e-03 -2.6516360e-03
 -7.0641343e-03 -7.8034755e-03 -9.1298986e-03 -5.9384662e-03
 -1.8479486e-03 -4.3349918e-03 -6.4662131e-03 -3.7247417e-03
  4.2940644e-03 -3.7448257e-03  8.3856434e-03  1.5289942e-03
 -7.2446703e-03  9.43905

# üìó **2. GloVe (Global Vectors for Word Representation)**

### üìò **Definition**

**GloVe** (by Stanford, 2014) learns word embeddings by analyzing **global word co-occurrence statistics** across the entire corpus.
It focuses on how often words appear together ‚Äî building a **co-occurrence matrix**, then factorizing it to learn embeddings.

---

### ‚úÖ **Advantages**

* Captures **global context** better than Word2Vec (which is local).
* Produces **consistent** embeddings using statistical information.
* Pretrained models available (trained on huge corpora like Wikipedia).

### ‚ùå **Disadvantages**

* Doesn‚Äôt handle **OOV** words.
* Cannot learn new embeddings once pretrained.
* Needs large memory for co-occurrence matrix.

---

### üíª **Code Example**

```python
# !pip install torchtext torch

from torchtext.vocab import GloVe
import torch

# Load pretrained GloVe embeddings (50 dimensions)
glove = GloVe(name="6B", dim=50)

# Get vector for a word
word_vec = glove["computer"]
print(word_vec)

# Compute similarity
sim = torch.cosine_similarity(glove["king"].unsqueeze(0), glove["queen"].unsqueeze(0))
print(f"Similarity(king, queen): {sim.item():.4f}")
```

---

# üìò **3. FastText**

### üìò **Definition**

**FastText** (by Facebook, 2016) extends Word2Vec by representing each word as a **bag of character n-grams**.
This allows it to understand **morphology** and generate embeddings for unseen words.

Example:
`"playing"` ‚Üí `["pla", "lay", "ayi", "yin", "ing"]`

---

### ‚úÖ **Advantages**

* Handles **out-of-vocabulary** (OOV) words.
* Captures **subword information** (prefixes, suffixes).
* Works well for **morphologically rich languages** (like German or Turkish).

### ‚ùå **Disadvantages**

* Slightly **slower** to train than Word2Vec.
* Requires more memory.
* Subword info might not always improve performance (for small datasets).

---

### üíª **Code Example**

```python
from gensim.models import FastText
from gensim.utils import simple_preprocess

# Corpus
sentences = [
    "I love machine learning and data science",
    "FastText creates embeddings using subwords",
    "It can handle unseen words like learnings"
]

tokenized_sentences = [simple_preprocess(s) for s in sentences]

# Train FastText model
model_ft = FastText(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Vector for a known word
print(model_ft.wv["learning"])

# Vector for an unseen (OOV) word
print(model_ft.wv["learnings"])  # Works!
```






---

# üìä **üîç Summary Table**

| Model        | Year | Creator  | Key Idea                    | Handles OOV | Pros                              | Cons                       |
| ------------ | ---- | -------- | --------------------------- | ----------- | --------------------------------- | -------------------------- |
| **Word2Vec** | 2013 | Google   | Predictive (CBOW/Skip-Gram) | ‚ùå           | Semantic relationships, efficient | No OOV, ignores morphology |
| **GloVe**    | 2014 | Stanford | Global co-occurrence matrix | ‚ùå           | Global context, pretrained models | Static, no OOV             |
| **FastText** | 2016 | Facebook | Subword (character n-grams) | ‚úÖ           | Handles OOV, morphological info   | Slower, more memory        |

---

Would you like me to add a **visualization (PCA or t-SNE)** to compare how similar words cluster across these models?




#### **6. Contextual Word Embeddings (BERT, GPT, etc.) (2018 onward)**

* **Idea:** Represent words **in context**, so ‚Äúbank‚Äù in ‚Äúriver bank‚Äù ‚â† ‚Äúbank‚Äù in ‚Äúmoney bank‚Äù.
* **Examples:** ELMo (2018), BERT (2018), GPT series (2018+)
* **Benefit:** State-of-the-art performance across NLP tasks.
* **Limitation:** Computationally expensive.


# üß† **Contextual Word Embeddings | NLP Lecture 5**

---

## I. **Introduction**

Traditional techniques like **One-Hot**, **BoW**, **N-grams**, and **TF‚ÄìIDF** treat each word as **independent** and **context-free**.
They fail to capture:

* **Meaning differences** based on context (e.g., *bank* = river bank vs. money bank)
* **Semantic similarity** between related words (*good*, *great*, *excellent*)

‚û°Ô∏è **Contextual Word Embeddings** solve these issues using **deep learning**.

---

## II. **From Static to Contextual Embeddings**

### 1. **Static Embeddings (Word2Vec, GloVe, FastText)**

* Each word has **one fixed vector**, no matter where it appears.
* ‚Äúbank‚Äù ‚Üí same vector in both ‚Äúriver bank‚Äù and ‚Äúbank loan‚Äù.
* Captures general *semantic similarity*, but **no context**.

### 2. **Contextual Embeddings (ELMo, BERT, GPT, etc.)**

* Word meaning changes **depending on its sentence context**.
* ‚Äúbank‚Äù in ‚Äúriver bank‚Äù and ‚Äúbank loan‚Äù ‚Üí **different embeddings**.
* Generated dynamically by **transformer-based** neural networks.

---

## III. **How Contextual Embeddings Work**

### üîπ **Architecture**

Most modern embeddings are based on the **Transformer architecture**, introduced in *‚ÄúAttention is All You Need‚Äù (Vaswani et al., 2017)*.

Key idea:
üëâ Use **Self-Attention** to learn relationships between all words in a sentence.

### üîπ **Self-Attention**

* Every word looks at all other words to understand its context.
* Example:

  * Sentence: *‚ÄúThe bank of the river was flooded.‚Äù*
  * The model looks at ‚Äúriver‚Äù ‚Üí understands that *bank* refers to a geographical feature.

---

## IV. **Popular Contextual Embedding Models**

| Model               | Type                  | Key Idea                              | Notes                        |
| ------------------- | --------------------- | ------------------------------------- | ---------------------------- |
| **ELMo (2018)**     | BiLSTM                | Contextual from both directions       | First major contextual model |
| **BERT (2019)**     | Transformer           | Bidirectional context using attention | Widely used in NLP tasks     |
| **GPT (2018‚Äì2024)** | Transformer (decoder) | Predicts next word (left-to-right)    | Used for generation tasks    |
| **RoBERTa**         | BERT variant          | More training, better performance     |                              |
| **DistilBERT**      | Lightweight BERT      | 40% smaller, similar accuracy         |                              |

---

## V. **Example: Using BERT Embeddings**

### üíª **Code Example (Using `transformers` library)**


üîπ Example 1 ‚Äî BERT (Bidirectional Encoder Representations from Transformers)


In [None]:
# !pip install transformers torch

from transformers import BertTokenizer, BertModel
import torch

# Load pretrained model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Example sentences
sentence = "I went to the bank to withdraw money."
inputs = tokenizer(sentence, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract last hidden states (contextual embeddings)
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states.shape)  # [batch_size, sequence_length, hidden_size]

# Example: get embedding for "bank"
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print(tokens)
bank_index = tokens.index("bank")
print(f"Embedding for 'bank':\n", last_hidden_states[0, bank_index, :5])  # first 5 dims


üîπ Example 2 ‚Äî GPT (Generative Pre-trained Transformer)

In [None]:
from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

sentence = "The river bank was full of trees."
inputs = tokenizer(sentence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Get last hidden states
embeddings = outputs.last_hidden_state
print(embeddings.shape)

# Inspect token-level embeddings
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print(tokens)

üîπ Example 3 ‚Äî Compare Contexts for ‚Äúbank‚Äù

In [None]:
sentences = [
    "I deposited money in the bank.",
    "We sat by the river bank and watched the water."
]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    bank_index = tokens.index("bank")
    embedding = outputs.last_hidden_state[0, bank_index, :]
    print(f"Sentence: {sentence}")
    print(f"Bank embedding (first 5 dims): {embedding[:5]}")
    print("-" * 50)


| Model    | Year       | Architecture        | Context Type  | Pros                         | Cons                       |
| -------- | ---------- | ------------------- | ------------- | ---------------------------- | -------------------------- |
| **ELMo** | 2018       | Bi-LSTM             | Bidirectional | Context-aware, simple        | Slower, older architecture |
| **BERT** | 2018       | Transformer Encoder | Bidirectional | Best for understanding tasks | Heavy, non-generative      |
| **GPT**  | 2018‚Äì2023+ | Transformer Decoder | Left-to-right | Generative power, flexible   | Unidirectional context     |




```python
!pip install transformers torch --quiet

from transformers import BertTokenizer, BertModel
import torch

# Load pretrained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example sentence
sentence = "The bank of the river was flooded."

# Tokenize input
inputs = tokenizer(sentence, return_tensors='pt')

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)

# outputs contains:
# - last_hidden_state: embeddings for each token
# - pooler_output: embedding for entire sentence (CLS token)
word_embeddings = outputs.last_hidden_state
sentence_embedding = outputs.pooler_output

print("Word Embeddings Shape:", word_embeddings.shape)
print("Sentence Embedding Shape:", sentence_embedding.shape)
```

### üß© **Output Explanation**

* `word_embeddings`: Tensor of shape `[1, num_tokens, 768]` ‚Üí one 768-dim vector per token.
* `sentence_embedding`: `[1, 768]` ‚Üí condensed representation of entire sentence.

---

## VI. **Understanding the Embedding Dimensions**

| Model           | Vector Size | Layers | Context Type         |
| --------------- | ----------- | ------ | -------------------- |
| **BERT-base**   | 768         | 12     | Bidirectional        |
| **BERT-large**  | 1024        | 24     | Bidirectional        |
| **GPT-2 small** | 768         | 12     | Unidirectional       |
| **ELMo**        | 1024        | 2      | Bidirectional (LSTM) |

Each embedding dimension encodes complex linguistic patterns such as syntax, semantics, and relationships.

---

## VII. **Sentence-Level Embeddings**

Instead of token-level vectors, sometimes we need **whole-sentence embeddings** (for tasks like similarity or classification).

### üîπ Options:

1. **[CLS] Token (BERT):**
   Use the first token embedding (sentence representation).
2. **Mean Pooling:**
   Average over all token embeddings.
3. **Sentence-BERT (SBERT):**
   Fine-tuned version of BERT for **semantic similarity**.

### üíª Example (SBERT)

```python
!pip install sentence-transformers --quiet
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["The cat sat on the mat.", "A dog lay on the rug."]

embeddings = model.encode(sentences)

print("Embedding Shape:", embeddings.shape)
```

‚û°Ô∏è Output: `(2, 384)` ‚Äî Each sentence = 384-dimensional dense vector.
You can then compute **cosine similarity** to measure semantic closeness.

---

## VIII. **Applications of Contextual Embeddings**

| Task                               | Use                                         |
| ---------------------------------- | ------------------------------------------- |
| **Text Classification**            | Sentiment, topic, emotion detection         |
| **Named Entity Recognition (NER)** | Extracting entities like names, dates, etc. |
| **Question Answering**             | Understanding context in questions          |
| **Semantic Search**                | Finding meaning-based matches               |
| **Machine Translation**            | Context-aware language translation          |
| **Chatbots & Summarization**       | Contextual understanding of input text      |

---

## IX. **Advantages vs. Traditional Methods**

| Feature                | Traditional (TF-IDF / BoW) | Contextual (BERT / GPT)                 |
| ---------------------- | -------------------------- | --------------------------------------- |
| Word meaning           | Fixed                      | Context-aware                           |
| Vocabulary handling    | OOV issue                  | Handles unseen words via subword tokens |
| Sparsity               | High                       | Dense                                   |
| Training need          | None (handcrafted)         | Pretrained deep models                  |
| Computation            | Lightweight                | Heavy but powerful                      |
| Semantic understanding | Weak                       | Strong                                  |

---

## X. **Key Takeaways**

* **Contextual embeddings** revolutionized NLP by enabling models to *understand meaning in context*.
* They are **dense, continuous, and dynamic** representations.
* Models like **BERT, GPT, RoBERTa, and SBERT** form the backbone of modern NLP systems.
* These embeddings power applications like ChatGPT, semantic search, and summarization.

---

## XI. **Practice Exercise**

1. Load a small text dataset (e.g., IMDb reviews).
2. Generate:

   * TF-IDF vectors
   * BERT sentence embeddings
3. Compare similarity between sentences using cosine similarity.
4. Visualize embeddings with **PCA or t-SNE**.

---



---

### üß≠ **Summary Timeline**

| Era         | Technique                         | Type            | Key Idea                   |
| ----------- | --------------------------------- | --------------- | -------------------------- |
| 1950s‚Äì1960s | One-Hot Encoding                  | Sparse          | Binary identity vectors    |
| 1980s‚Äì1990s | Bag of Words (BoW)                | Sparse          | Count-based, ignores order |
| 1990s‚Äì2000s | N-grams                           | Sparse          | Adds local context         |
| 1990s‚Äì2000s | TF‚ÄìIDF                            | Weighted Sparse | Importance weighting       |
| 2013+       | Word Embeddings (Word2Vec, GloVe) | Dense           | Semantic similarity        |
| 2018+       | Contextual Embeddings (BERT, GPT) | Dense           | Meaning depends on context |

---
