# TF-IDF Vectorization (Term Frequency-Inverse Document Frequency)

**Definition:**
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document (term frequency), but is offset by how common the word is across all documents (inverse document frequency). TF-IDF is widely used in information retrieval, text mining, and natural language processing to transform text into meaningful numerical features for machine learning algorithms.

- **TF (Term Frequency):** Measures how frequently a term appears in a document.
- **IDF (Inverse Document Frequency):** Measures how unique or rare a term is across all documents in the corpus.
- **TF-IDF:** The product of TF and IDF, giving higher scores to terms that are frequent in a document but rare in the corpus.

**Key idea:**
- Common words (like "the", "is") get low scores.
- Words that are frequent in a document but rare in the corpus get high scores.

---

## 1️⃣ Formulas
- **Term Frequency (TF):**
  - Measures how frequently a term appears in a document.
  - $TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total terms in document } d}$

- **Inverse Document Frequency (IDF):**
  - Measures how important a term is in the whole corpus.
  - $IDF(t) = \log_{10}\left(\frac{N}{DF(t)}\right)$ (manual calculation, log base 10, no offset)
    - $N$ = total number of documents
    - $DF(t)$ = number of documents containing term $t$

- **TF-IDF:**
  - $TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$

---

## ⚠️ Note on scikit-learn's TfidfVectorizer
- By default, scikit-learn uses the natural logarithm (log base e) and adds 1 to the IDF:
  - $IDF_{sklearn}(t) = \log_e\left(\frac{N}{DF(t)}\right) + 1$
- This means even very common words (like "the") get a nonzero value in the code output.
- The manual example above uses log base 10 and no offset, so common words get 0.
- This is why the TF-IDF values in the code and the manual table may differ.

---

## 2️⃣ Example Documents
Suppose we have 3 documents:
- d1: "the cat sat"
- d2: "the dog sat"
- d3: "the cat ate"

### Step 1: Build the Vocabulary
All unique terms: **the, cat, sat, dog, ate**

### Step 2: Term Frequency Table
| Term | d1 | d2 | d3 |
|------|----|----|----|
| the  |  1 |  1 |  1 |
| cat  |  1 |  0 |  1 |
| sat  |  1 |  1 |  0 |
| dog  |  0 |  1 |  0 |
| ate  |  0 |  0 |  1 |

Each document has 3 words, so $TF(t, d) = \frac{\text{count}}{3}$

| Term | d1 (TF) | d2 (TF) | d3 (TF) |
|------|---------|---------|---------|
| the  | 1/3     | 1/3     | 1/3     |
| cat  | 1/3     | 0       | 1/3     |
| sat  | 1/3     | 1/3     | 0       |
| dog  | 0       | 1/3     | 0       |
| ate  | 0       | 0       | 1/3     |

### Step 3: Document Frequency (DF) and IDF
- the: appears in d1, d2, d3 → DF = 3
- cat: d1, d3 → DF = 2
- sat: d1, d2 → DF = 2
- dog: d2 → DF = 1
- ate: d3 → DF = 1

$N = 3$ (number of documents)

| Term | DF | IDF = $\log_{10}(N/DF)$ |
|------|----|------------------------|
| the  | 3  | $\log_{10}(3/3)=0$     |
| cat  | 2  | $\log_{10}(3/2)=0.176$ |
| sat  | 2  | $\log_{10}(3/2)=0.176$ |
| dog  | 1  | $\log_{10}(3/1)=0.477$ |
| ate  | 1  | $\log_{10}(3/1)=0.477$ |

---

### Step 4: TF-IDF Calculation
$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$

| Term | d1 | d2 | d3 |
|------|----------------|----------------|----------------|
| the  | 0              | 0              | 0              |
| cat  | 0.333×0.176=0.059 | 0              | 0.333×0.176=0.059 |
| sat  | 0.333×0.176=0.059 | 0.333×0.176=0.059 | 0              |
| dog  | 0              | 0.333×0.477=0.159 | 0              |
| ate  | 0              | 0              | 0.333×0.477=0.159 |

---

### Final TF-IDF Table
| Term | d1    | d2    | d3    |
|------|-------|-------|-------|
| the  | 0     | 0     | 0     |
| cat  | 0.059 | 0     | 0.059 |
| sat  | 0.059 | 0.059 | 0     |
| dog  | 0     | 0.159 | 0     |
| ate  | 0     | 0     | 0.159 |

**Interpretation:**
- Words like "the" (in every doc) get 0 weight in the manual calculation, but may get a small value in scikit-learn.
- Words unique to a document ("dog", "ate") get the highest weight in that document.
- Words in some but not all docs ("cat", "sat") get medium weight.

---

**Summary:**
- TF-IDF highlights words that are important to a document but not common across all documents.

In [2]:
# Example: TF-IDF Calculation in Python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example documents
docs = [
    "the cat sat",  # d1
    "the dog sat",  # d2
    "the cat ate"   # d3
]

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer(norm=None, use_idf=True, smooth_idf=False)
X = vectorizer.fit_transform(docs)

# Get feature names and TF-IDF matrix
features = vectorizer.get_feature_names_out()
tfidf_matrix = X.toarray()

# Show as a DataFrame for clarity
df = pd.DataFrame(tfidf_matrix, columns=features, index=["d1", "d2", "d3"])
print("TF-IDF Matrix:")
print(df.round(3))

TF-IDF Matrix:
      ate    cat    dog    sat  the
d1  0.000  1.405  0.000  1.405  1.0
d2  0.000  0.000  2.099  1.405  1.0
d3  2.099  1.405  0.000  0.000  1.0


# Count Vectorization (Bag-of-Words Model)

**Definition:**
Count Vectorization, also known as the Bag-of-Words (BoW) model, is a simple and widely used technique to convert text documents into numerical feature vectors. It represents each document by counting the number of times each word appears, ignoring grammar and word order but keeping multiplicity.

- **Key idea:**
  - Each unique word in the corpus becomes a feature (column).
  - Each document is represented as a vector of word counts.
  - The result is a sparse matrix where each row is a document and each column is a word from the vocabulary.

---

## 1️⃣ Formula
- For a document $d$ and term $t$:
  - $\text{Count}(t, d) = \text{Number of times term } t \text{ appears in document } d$

---

## 2️⃣ Example Documents
Suppose we have the same 3 documents:
- d1: "the cat sat"
- d2: "the dog sat"
- d3: "the cat ate"

### Step 1: Build the Vocabulary
All unique terms: **the, cat, sat, dog, ate**

### Step 2: Count Vector Table
| Term | d1 | d2 | d3 |
|------|----|----|----|
| the  |  1 |  1 |  1 |
| cat  |  1 |  0 |  1 |
| sat  |  1 |  1 |  0 |
| dog  |  0 |  1 |  0 |
| ate  |  0 |  0 |  1 |

- Each cell shows the number of times the word appears in the document.
- Each document is now a vector: e.g., d1 = [1, 1, 1, 0, 0] (for [the, cat, sat, dog, ate])

---

**Summary:**
- Count Vectorization is simple and fast, but does not consider word importance or frequency across documents (unlike TF-IDF).
- It is often used as a baseline for text feature extraction.

In [4]:
# Example: Count Vectorization in Python
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example documents
docs = [
    "the cat sat",  # d1
    "the dog sat",  # d2
    "the cat ate"   # d3
]

# Create the CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

# Get feature names and count matrix
features = vectorizer.get_feature_names_out()
count_matrix = X.toarray()

# Show as a DataFrame for clarity
df = pd.DataFrame(count_matrix, columns=features, index=["d1", "d2", "d3"])
print("Count Vector Matrix:")
print(df)

Count Vector Matrix:
    ate  cat  dog  sat  the
d1    0    1    0    1    1
d2    0    0    1    1    1
d3    1    1    0    0    1
