```{contents}
```

# TF-IDF

TF-IDF is a **numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus).**

It combines two measures:

1. **TF (Term Frequency):** How often a word appears in a document.
2. **IDF (Inverse Document Frequency):** How unique or rare a word is across all documents in the corpus.

Mathematically:

$$
\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)
$$

Where:

$$
\text{IDF}(t) = \log \frac{N}{1 + \text{DF}(t)}
$$

* $t$ → term (word)
* $d$ → document
* $N$ → total number of documents
* $\text{DF}(t)$ → number of documents containing the term $t$
* Adding 1 in the denominator avoids division by zero

---

### **2. Intuition**

* **High TF:** Word appears frequently in a document → important for that document.
* **High IDF:** Word appears in fewer documents → more unique → carries more information.
* **High TF-IDF:** Word is frequent in a document **and** rare across other documents → highly significant.

**Example:**

* Word "the" → high TF but appears in almost all documents → low IDF → low TF-IDF.
* Word "NLP" → appears multiple times in a specific document but rarely elsewhere → high TF-IDF.

---

### **3. Example**

Suppose we have **3 documents**:

```
Doc1: "I love NLP"
Doc2: "NLP is amazing"
Doc3: "I love Python"
```

**Step 1: Calculate TF**

* Doc1: I(1), love(1), NLP(1) → total 3 words
* TF("NLP", Doc1) = 1 / 3 ≈ 0.333

**Step 2: Calculate IDF**

* NLP appears in 2 documents → DF("NLP") = 2
* Total documents N = 3
* IDF("NLP") = log(3 / (1+2)) = log(3/3) = log(1) = 0

**Step 3: TF-IDF**

* TF-IDF("NLP", Doc1) = TF × IDF = 0.333 × 0 = 0

> Note: Words appearing in almost all documents get TF-IDF close to 0.

* Word "Python" appears only in Doc3 → TF-IDF will be high for Doc3.

---



In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
docs = [
    "I love NLP",
    "NLP is amazing",
    "I love Python"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Feature names (words)
print("Vocabulary:", vectorizer.get_feature_names_out())

# TF-IDF matrix
print("TF-IDF matrix:\n", X.toarray())


Vocabulary: ['amazing' 'is' 'love' 'nlp' 'python']
TF-IDF matrix:
 [[0.         0.         0.70710678 0.70710678 0.        ]
 [0.62276601 0.62276601 0.         0.4736296  0.        ]
 [0.         0.         0.60534851 0.         0.79596054]]



---

### **5. Key Points**

* TF-IDF reduces the weight of common words like "the", "is", "and".
* Highlights **words unique to a document**.
* Widely used in **text classification, search engines, and recommendation systems**.

Absolutely! Let’s break down the **intuition behind TF-IDF** in a simple, clear way.

---

### **TF-IDF Intuition**

TF-IDF stands for **Term Frequency – Inverse Document Frequency**. It’s a way to **weight words based on importance** in a corpus. The key idea is:

1. Words that appear **often in a document** are important → TF (Term Frequency).
2. Words that appear **in many documents** are less informative → IDF (Inverse Document Frequency).

---

### **Step 1: Term Frequency (TF)**

* Measures how often a word occurs in a document.
* Intuition: If a word occurs more often in a document, it’s probably important for that document.

$$
TF(word) = \frac{\text{Number of times word appears in doc}}{\text{Total number of words in doc}}
$$

**Example:**

Document: `"I love NLP and NLP is amazing"`

* Word `"NLP"` occurs 2 times out of 6 words → TF(NLP) = 2/6 = 0.33
* Word `"amazing"` occurs 1 time out of 6 → TF(amazing) = 1/6 = 0.167

---

### **Step 2: Inverse Document Frequency (IDF)**

* Measures how rare a word is across all documents.
* Intuition: Words like `"the"`, `"is"`, `"and"` appear everywhere → not important. Words like `"NLP"`, `"Python"` appear less → more important.

$$
IDF(word) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing the word}}
$$

**Example:**

Corpus:

1. `"I love NLP"`
2. `"NLP is amazing"`
3. `"I love Python"`

* `"NLP"` appears in 2 documents → IDF(NLP) = log(3/2) ≈ 0.176
* `"Python"` appears in 1 document → IDF(Python) = log(3/1) ≈ 1.098

---

### **Step 3: TF-IDF**

Finally, multiply TF and IDF to get **TF-IDF weight**:

$$
TF\text{-}IDF(word, doc) = TF(word, doc) \times IDF(word)
$$

* **High TF-IDF → Important word in the document**
* **Low TF-IDF → Common or less relevant word**

**Intuition:**

* Words that are **frequent in a document** but **rare in the corpus** get the **highest scores**.
* Words that are **common across documents** get **low scores**, even if frequent in one document.

---

**Analogy**

* Imagine reading news articles:

  * The word `"the"` appears in every article → not useful.
  * The word `"NLP"` appears in only tech articles → important.
* TF-IDF mathematically captures this intuition.
