```{contents}
```
# Vectorization

## **1. Bag of Words (BoW)**

* **Concept:** Represents a text as a **vector of word counts**. Ignores grammar and word order.
* **Steps:**

  1. Build a vocabulary of all unique words across the corpus.
  2. Count the frequency of each word in every document.
* **Pros:** Simple, easy to implement.
* **Cons:** Ignores context and semantics, sparse vectors.

**Example:**
Corpus: \["I love NLP", "NLP is amazing"]
BoW vectors:

* Doc1: `[1, 0, 1, 0, 1]`
* Doc2: `[0, 1, 1, 1, 0]`

---

## **2. TF (Term Frequency)**

* **Concept:** Represents words by their frequency in a document.
* **Formula:**

  $$
  TF(word) = \frac{\text{Number of times word appears in document}}{\text{Total number of words in document}}
  $$
* **Pros:** Captures importance of words relative to the document.

---

## **3. TF-IDF (Term Frequency-Inverse Document Frequency)**

* **Concept:** Adjusts term frequency by how rare a word is across all documents. Rare words get more weight.
* **Formula:**

  $$
  TFIDF(word) = TF(word) \times \log\frac{N}{DF(word)}
  $$

  * $N$ = Total number of documents
  * $DF(word)$ = Number of documents containing the word
* **Pros:** Reduces importance of common words like "the", "is".
* **Use:** Widely used in text classification and information retrieval.

---

## **4. Word Embeddings**

* **Concept:** Dense vector representations capturing **semantic meaning** of words.
* **Techniques:**

  * **Word2Vec:** Predicts a word from its context (skip-gram or CBOW).
  * **GloVe:** Uses global word co-occurrence matrix.
  * **FastText:** Captures subword information for rare words.
* **Pros:** Captures meaning, similar words have close vectors.
* **Cons:** Pretrained embeddings may not always fit your corpus.

---

## **5. Contextualized Embeddings (Transformers)**

* **Concept:** Word representation depends on **context in the sentence**.
* **Models:** BERT, GPT, RoBERTa, XLNet, etc.
* **Pros:** Handles polysemy (words with multiple meanings), state-of-the-art performance.
* **Cons:** Computationally expensive.

---

## **6. One-Hot Encoding**

* **Concept:** Represent each word as a **binary vector** with a 1 at the word’s index in the vocabulary.
* **Pros:** Simple, easy to implement.
* **Cons:** Very sparse, does not capture meaning or similarity.

---

**Comparison Summary**

| Technique             | Sparse/Dense | Captures Semantics | Context Awareness |
| --------------------- | ------------ | ------------------ | ----------------- |
| Bag of Words          | Sparse       | ❌                  | ❌                 |
| TF                    | Sparse       | ❌                  | ❌                 |
| TF-IDF                | Sparse       | ❌                  | ❌                 |
| Word Embeddings       | Dense        | ✅                  | ❌                 |
| Contextual Embeddings | Dense        | ✅                  | ✅                 |
| One-Hot Encoding      | Sparse       | ❌                  | ❌                 |


```{dropdown} Click here for Sections
```{tableofcontents}