# Vectorization

## Overview

This notebook covers **vectorization**, the process of converting text into numerical features that machine learning models can process. We explore tokenization (defining the smallest units of text), n-grams (unigrams, bigrams, trigrams), and how to use scikit-learn's vectorizer classes to transform text into numerical representations. Understanding sparse matrices and their importance for text data is crucial for efficient text processing in NLP pipelines.

## Objectives

- Understand why we need to convert text into numerical features
- Understand the smallest unit of meaning (token); and Unigrams, Bigrams, Trigrams, etc.
- Learn how to vectorize text using the vectorizer classes in `scikit-learn`
- Recognize sparse matrices and why they are essential for text data

## Outline

1. **Tokenization** - Defining the smallest units of text processing
2. **Collocation and N-grams** - Unigrams, bigrams, trigrams, and word sequences
3. **Bag of Words (BoW)** - Simple word counting approach
4. **TF-IDF (Term Frequency-Inverse Document Frequency)** - Weighted word importance
5. **Scikit-learn Vectorizers** - Using `CountVectorizer` and `TfidfVectorizer`
6. **Sparse Matrices** - Understanding why sparse representations are essential for text data
7. **Practical Examples** - Vectorizing real text data

In [None]:
# %pip install numpy==1.26.4 pandas==2.3.3 scikit-learn==1.8.0 --quiet

In [None]:
# Standard library imports
# (none needed for this notebook)

# Third-party imports
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Tokenization

Before dealing with text, we have to first define what a *word* is.

**Tokenization:** segment text into its smallest units of processing (atoms).

Tokens can be words, sub-words, characters, or even bytes. They could also be spaces or tabs (for coding). It all depends on how we're going to make use of this unit later.

## Collocation

**Unigram**: A single token considered in isolation. This is the simplest form (1-gram).

**Collocation:** is a sequence of words that occur together unusually very often.

Examples of Bigrams (2-gram):

- `"United States"`
- `"fellow citizens"`
- `"Federal Government"`
- `"General Government"`
- `"Vice President"`
- `"God bless"`
- `"White Whale"`

A Bigram in Arabic could be:

- `"الذكاء الاصطناعي"`

A Trigram in Arabic could be:

- `"ما شاء الله"`

## Sparse Representation

When we convert text documents into vectors, we create a **document-term matrix** where:
- Each row represents a document
- Each column represents a unique word (term) in the vocabulary
- Each cell contains the count or weight of that word in that document

**Why are these matrices sparse?**

In natural language, each document contains only a small fraction of all possible words. For example:
- A corpus might have 10,000 unique words (vocabulary size)
- A single document might use only 100-200 of those words
- This means 98-99% of the vector entries are zero

**Sparse matrices** are data structures that efficiently store only non-zero values, saving massive amounts of memory. Instead of storing millions of zeros, we store only the few non-zero values and their positions.

**Benefits:**
- **Memory efficiency**: Can handle corpora with millions of documents and hundreds of thousands of words
- **Computational efficiency**: Operations skip zero values, making computations faster
- **Scalability**: Enables working with large-scale text data that wouldn't fit in memory otherwise

This is why text vectorization techniques (BoW, TF-IDF) are designed to work with sparse representations.

We will use a corpus with two distinct "topics": **Food** and **Computing**.

In [None]:
# A small corpus with distinct topics for clarity
corpus = [
    "The server is down and needs a reboot.",      # Computing
    "I love a hamburger with cheese and fries.",   # Food
    "The new code has a bug in the server.",       # Computing
    "Can I order a cheese burger with fries?",     # Food
    "Reboot the server to fix the code bug."       # Computing
]

categories = [
    "Computing",
    "Food",
    "Computing",
    "Food",
    "Computing"
]

print(f"Corpus size: {len(corpus)} documents")

Vectorization enables three main types of NLP tasks where word frequency and distribution matter more than precise meaning:

1. **Text Classification**: Assigning documents to predefined categories (e.g., spam/not spam, positive/negative sentiment)
2. **Information Retrieval**: Finding relevant documents from a collection based on a query (e.g., search engines)
3. **Topic Modeling**: Discovering hidden topics in a collection of unlabeled documents (unsupervised learning)

For these tasks, we use statistical techniques that capture word patterns:

1. **Bag of Words (BoW):** represent documents as a vector of word counts.
2. **TF-IDF (Term Frequency-Inverse Document Frequency):** represent documents as a vector of word frequencies weighted by their importance.

Let's explore these vectorization techniques and understand how they support these tasks.

### 1. Bag of Words (BoW)

**Concept:** The simplest method. We discard grammar and order, keeping only the **count** of each word. The document becomes a fixed-length vector of numbers.

**Key Class:** `CountVectorizer`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# 1. Initialize the Vectorizer
# stop_words='english' removes common words like 'the', 'is', 'and'
vectorizer_bow = CountVectorizer(stop_words='english')

# 2. Fit and Transform the corpus
X_bow = vectorizer_bow.fit_transform(corpus)

# --- Visualization & Analysis ---

# Get the vocabulary (the "bag")
feature_names = vectorizer_bow.get_feature_names_out()

# Convert to DataFrame for readability
df_bow = pd.DataFrame(X_bow.toarray(), columns=feature_names)

df_bow

In [None]:
# ngram_range=(1, 2) means we want both unigrams AND bigrams
vectorizer_ngram = CountVectorizer(stop_words='english', ngram_range=(1, 2))
X_ngram = vectorizer_ngram.fit_transform(corpus)

feature_names_ngram = vectorizer_ngram.get_feature_names_out()

pd.DataFrame(X_ngram.toarray(), columns=feature_names_ngram)

## 2. TF-IDF (Term Frequency - Inverse Document Frequency)

In information retrieval, **TF-IDF** (term frequency–inverse document frequency), **is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general**.

It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling.

<img src="https://mallahyari.github.io/ml_tutorial/images/tfidf_ex3.png">

Image Source: [mallahyari](https://mallahyari.github.io/ml_tutorial/tfidf/)

Let's break it down:

#### Term Frequency (TF)

**TF** measures how common a term $t$ is in a document $d$.

`TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)`

The formula is:

$$
\text{TF}(t, d) = \frac{n(t, d)}{\sum_{t' \in d} n(t', d)}
$$

where:

* $t$ is the term
* $d$ is the document
* $n(t, d)$ is the number of times term $t$ appears in document $d$

#### Inverse Document Frequency (IDF)

**IDF** measures how specific a term is to certain documents. Hence, it is the inverse-frequency.

`IDF(t) = log_e(Total number of documents / Number of documents with term t in it)`

The formula is:

$$
\text{IDF}(t) = \log \left(\frac{|D|}{\sum_{d \in D} n(t, d)} \right)
$$

where:

* $D$ is the set of all documents in the corpus
* $|D|$ is just the number of documents in the corpus

#### TF + IDF

**TF-IDF**: combines the TF and IDF scores to give a measure of the overall importance of a term in a document (and hence how representative it is). Higher score indicates that some term `t` is more specific (IDF) and more occurring (TF) to some document `d` than other documents in the corpus:

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \cdot \text{IDF}(t)
$$

##### Example

Imagine the term $t$ appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of $t$ can be calculated as follow:

$$
TF= \frac{20}{100} = 0.2
$$

Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain the term $t$, then, Inverse Document Frequency (IDF) of $t$ can be calculated as follows:

$$
IDF = log \frac{10000}{100} = 2
$$

Using these two quantities, we can calculate TF-IDF score of the term $t$ for the document:

$$
\textit{TF-IDF} = 0.2 * 2 = 0.4
$$

**Analogy:** Think of TF-IDF like a restaurant review system:
- **TF (Term Frequency)**: How many times a word appears in a review (like how many times "delicious" appears)
- **IDF (Inverse Document Frequency)**: How rare the word is across all reviews (if "delicious" appears in every review, it's not distinctive)
- **TF-IDF**: Combines both—words that appear often in one review but rarely in others are most informative

**Key Class:** `TfidfVectorizer`

#### Understanding TF-IDF: Compute-by-Hand Example

Let's calculate TF-IDF step-by-step for a simple example to understand how it works.

**Step 1: Define our corpus**

```python
# Simple corpus for demonstration
documents = [
    "machine learning is fun",      # Document 1
    "machine learning is hard",     # Document 2
    "python is fun"                 # Document 3
]
```

**Step 2: Calculate Term Frequency (TF)**

TF measures how common a term is in a document:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

Let's calculate TF for each word in each document:

| Document | Word | Count | Total Words | TF |
|----------|------|-------|-------------|-----|
| Doc 1 | machine | 1 | 4 | 1/4 = 0.25 |
| Doc 1 | learning | 1 | 4 | 1/4 = 0.25 |
| Doc 1 | is | 1 | 4 | 1/4 = 0.25 |
| Doc 1 | fun | 1 | 4 | 1/4 = 0.25 |
| Doc 2 | machine | 1 | 4 | 1/4 = 0.25 |
| Doc 2 | learning | 1 | 4 | 1/4 = 0.25 |
| Doc 2 | is | 1 | 4 | 1/4 = 0.25 |
| Doc 2 | hard | 1 | 4 | 1/4 = 0.25 |
| Doc 3 | python | 1 | 3 | 1/3 = 0.33 |
| Doc 3 | is | 1 | 3 | 1/3 = 0.33 |
| Doc 3 | fun | 1 | 3 | 1/3 = 0.33 |

**Step 3: Calculate Inverse Document Frequency (IDF)**

IDF measures how rare/common a term is across the corpus:

$$
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
$$

Let's count how many documents contain each word:

| Word | Documents Containing It | IDF Calculation | IDF |
|------|------------------------|-----------------|-----|
| machine | 2 (Doc 1, Doc 2) | log(3/2) = log(1.5) | 0.405 |
| learning | 2 (Doc 1, Doc 2) | log(3/2) = log(1.5) | 0.405 |
| is | 3 (all documents) | log(3/3) = log(1) | 0.000 |
| fun | 2 (Doc 1, Doc 3) | log(3/2) = log(1.5) | 0.405 |
| hard | 1 (Doc 2 only) | log(3/1) = log(3) | 1.099 |
| python | 1 (Doc 3 only) | log(3/1) = log(3) | 1.099 |

**Key insight:** Words that appear in many documents (like "is") get low IDF scores. Words that appear in few documents (like "hard" or "python") get high IDF scores.

**Step 4: Calculate TF-IDF**

TF-IDF combines TF and IDF:

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

Let's calculate TF-IDF for each word in each document:

| Document | Word | TF | IDF | TF-IDF |
|----------|------|----|-----|--------|
| Doc 1 | machine | 0.25 | 0.405 | 0.25 × 0.405 = **0.101** |
| Doc 1 | learning | 0.25 | 0.405 | 0.25 × 0.405 = **0.101** |
| Doc 1 | is | 0.25 | 0.000 | 0.25 × 0.000 = **0.000** |
| Doc 1 | fun | 0.25 | 0.405 | 0.25 × 0.405 = **0.101** |
| Doc 2 | machine | 0.25 | 0.405 | 0.25 × 0.405 = **0.101** |
| Doc 2 | learning | 0.25 | 0.405 | 0.25 × 0.405 = **0.101** |
| Doc 2 | is | 0.25 | 0.000 | 0.25 × 0.000 = **0.000** |
| Doc 2 | hard | 0.25 | 1.099 | 0.25 × 1.099 = **0.275** |
| Doc 3 | python | 0.33 | 1.099 | 0.33 × 1.099 = **0.363** |
| Doc 3 | is | 0.33 | 0.000 | 0.33 × 0.000 = **0.000** |
| Doc 3 | fun | 0.33 | 0.405 | 0.33 × 0.405 = **0.134** |

**Observations:**
- "is" appears in all documents, so it gets TF-IDF = 0 (not distinctive)
- "hard" and "python" appear in only one document each, so they get high TF-IDF scores (very distinctive)
- "machine" and "learning" appear in 2 documents, so they get moderate TF-IDF scores

**Why TF-IDF works:**
- Common words (like "is", "the") appear everywhere → low IDF → low TF-IDF → filtered out
- Distinctive words (like "python", "hard") appear in few documents → high IDF → high TF-IDF → emphasized

#### Using TF-IDF with Scikit-learn

Now let's see how to use TF-IDF in practice:

In [None]:
# Using TfidfVectorizer (similar to CountVectorizer)
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
vectorizer_tfidf = TfidfVectorizer(stop_words='english')

# Fit and transform the corpus
X_tfidf = vectorizer_tfidf.fit_transform(corpus)

# Get feature names
feature_names_tfidf = vectorizer_tfidf.get_feature_names_out()

# Convert to DataFrame for readability
df_tfidf = pd.DataFrame(
    np.round(X_tfidf.toarray(), 3),  # Round to 3 decimal places
    columns=feature_names_tfidf
)

print("TF-IDF Representation:")
print("(Higher values = more distinctive/important words)")
df_tfidf

**What to look for:**

* Look at the column `server`. You will see counts like `1` or `2` for the computing sentences, and `0` for the food sentences.
* The matrix is **sparse** (mostly zeros), which is typical in NLP.

## Key Takeaways

- **Vectorization** is the process of converting text into numerical features that machine learning models can process.

- **Tokenization** defines the smallest units of text processing (atoms) - tokens can be words, sub-words, characters, or bytes depending on the use case.

- **N-grams** are sequences of n tokens:
  - **Unigrams** (1-gram): Single tokens
  - **Bigrams** (2-gram): Pairs of consecutive tokens
  - **Trigrams** (3-gram): Triplets of consecutive tokens
  - N-grams capture word order and context

- **Bag of Words (BoW)** is a simple vectorization approach that counts word occurrences, ignoring word order.

- **TF-IDF (Term Frequency-Inverse Document Frequency)** weights words by their importance:
  - **TF**: How frequent a term is in a document
  - **IDF**: How rare/common a term is across the corpus
  - Terms frequent in a document but rare in the corpus get high weights

- **Scikit-learn vectorizers** (`CountVectorizer`, `TfidfVectorizer`) provide efficient, configurable text vectorization with support for:
  - N-gram extraction
  - Stop word removal
  - Vocabulary size limits
  - Custom tokenization

- **Sparse matrices** are essential for text data because:
  - Most documents contain only a small fraction of the vocabulary
  - Sparse representations save memory and computation
  - Most matrix elements are zeros

- Understanding vectorization is fundamental to building effective NLP pipelines and machine learning models for text data.