Here‚Äôs a **complete, easy-to-understand note on TF-IDF (Term Frequency‚ÄìInverse Document Frequency)** ‚Äî including concept, formula, step-by-step example, and advantages/disadvantages üëá

---

# üß† TF-IDF (Term Frequency ‚Äì Inverse Document Frequency)

## üìò Introduction

**TF-IDF** is a statistical measure used in **Natural Language Processing (NLP)** and **Information Retrieval (IR)** to evaluate how important a word is to a document in a collection or corpus.
It combines **Term Frequency (TF)** and **Inverse Document Frequency (IDF)** to give a weight to each word.

---

## üß© 1. Term Frequency (TF)

### üîπ Definition

TF measures how frequently a term (word) occurs in a document.

### üßÆ Formula:

[
TF(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
]

### üìñ Example

For document D‚ÇÅ:

> "NLP is fun and NLP is powerful"

| Term     | Count | TF (Count / Total words) |
| -------- | ----- | ------------------------ |
| NLP      | 2     | 2/6 = 0.33               |
| is       | 2     | 2/6 = 0.33               |
| fun      | 1     | 1/6 = 0.17               |
| powerful | 1     | 1/6 = 0.17               |

---

## üß© 2. Inverse Document Frequency (IDF)

### üîπ Definition

IDF measures how **important** a term is ‚Äî i.e., how rare or common it is across all documents.
A term that appears in **many documents** gets **low weight**, while a rare term gets **high weight**.

### üßÆ Formula:

[
IDF(t) = \log\left(\frac{N}{DF(t)}\right)
]

Where:

* **N** = total number of documents in the corpus
* **DF(t)** = number of documents that contain the term *t*

> (Sometimes, to avoid division by zero, we use:
> ( IDF(t) = \log\left(\frac{N}{1 + DF(t)}\right) ))

---

## üß© 3. TF-IDF Calculation

### üîπ Formula:

[
TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)
]

### üìä Interpretation:

* **High TF-IDF** ‚Üí term is frequent in this document but rare in others ‚Üí important!
* **Low TF-IDF** ‚Üí term is common across all documents ‚Üí less useful (e.g., ‚Äúthe‚Äù, ‚Äúis‚Äù).

---

## üìö Example

Let‚Äôs say we have 3 short documents:

| Document | Text                               |
| -------- | ---------------------------------- |
| D‚ÇÅ       | "NLP is fun"                       |
| D‚ÇÇ       | "NLP is cool"                      |
| D‚ÇÉ       | "NLP and machine learning are fun" |

### Step 1Ô∏è‚É£ ‚Äî Vocabulary

All unique words:
`["NLP", "is", "fun", "cool", "and", "machine", "learning", "are"]`

### Step 2Ô∏è‚É£ ‚Äî Compute TF for each document

| Term     | D‚ÇÅ  | D‚ÇÇ  | D‚ÇÉ  |
| -------- | --- | --- | --- |
| NLP      | 1/3 | 1/3 | 1/6 |
| is       | 1/3 | 1/3 | 0   |
| fun      | 1/3 | 0   | 1/6 |
| cool     | 0   | 1/3 | 0   |
| and      | 0   | 0   | 1/6 |
| machine  | 0   | 0   | 1/6 |
| learning | 0   | 0   | 1/6 |
| are      | 0   | 0   | 1/6 |

### Step 3Ô∏è‚É£ ‚Äî Compute IDF

| Term     | DF | IDF = log(N/DF) (N=3) |
| -------- | -- | --------------------- |
| NLP      | 3  | log(3/3)=0            |
| is       | 2  | log(3/2)=0.176        |
| fun      | 2  | 0.176                 |
| cool     | 1  | 0.477                 |
| and      | 1  | 0.477                 |
| machine  | 1  | 0.477                 |
| learning | 1  | 0.477                 |
| are      | 1  | 0.477                 |

### Step 4Ô∏è‚É£ ‚Äî Compute TF-IDF (TF √ó IDF)

| Term     | D‚ÇÅ    | D‚ÇÇ    | D‚ÇÉ    |
| -------- | ----- | ----- | ----- |
| NLP      | 0     | 0     | 0     |
| is       | 0.058 | 0.058 | 0     |
| fun      | 0.058 | 0     | 0.029 |
| cool     | 0     | 0.159 | 0     |
| and      | 0     | 0     | 0.079 |
| machine  | 0     | 0     | 0.079 |
| learning | 0     | 0     | 0.079 |
| are      | 0     | 0     | 0.079 |

### üß† Interpretation:

* Words like **‚Äúcool‚Äù**, **‚Äúmachine‚Äù**, **‚Äúlearning‚Äù** get **higher TF-IDF** ‚Üí more unique ‚Üí more important.
* Common words like **‚ÄúNLP‚Äù**, **‚Äúis‚Äù** get low scores ‚Üí less informative.

---

## üí° Applications of TF-IDF

1. **Information Retrieval/Search Engines** ‚Äì rank documents by relevance.
2. **Text Classification** ‚Äì convert text into numeric features.
3. **Keyword Extraction** ‚Äì identify important terms in a document.
4. **Document Similarity** ‚Äì compare texts using cosine similarity of TF-IDF vectors.

---

## ‚öñÔ∏è Advantages & Disadvantages

| ‚úÖ Advantages                             | ‚ùå Disadvantages                                    |
| ---------------------------------------- | -------------------------------------------------- |
| Simple and effective text representation | Ignores word order (bag-of-words approach)         |
| Highlights important/rare terms          | Does not capture semantics (e.g., synonyms)        |
| Works well for smaller datasets          | High dimensionality for large vocabularies         |
| Used in many ML/NLP tasks                | Static weights ‚Äî not context-aware like embeddings |

---

## üß∞ Python Example (Using Scikit-learn)

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus
corpus = [
    "NLP is fun",
    "NLP is cool",
    "NLP and machine learning are fun"
]

# Initialize vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform
X = vectorizer.fit_transform(corpus)

# Display feature names and TF-IDF matrix
print(vectorizer.get_feature_names_out())
print(X.toarray())
```

### üñ•Ô∏è Output (example)

```
['and' 'are' 'cool' 'fun' 'is' 'learning' 'machine' 'nlp']
[[0.   0.   0.   0.707 0.707 0.   0.   0.   ]
 [0.   0.   0.707 0.   0.707 0.   0.   0.   ]
 [0.377 0.377 0.   0.377 0.   0.377 0.377 0.377]]
```

---

## üßæ Summary

| Concept    | Meaning                                             |
| ---------- | --------------------------------------------------- |
| **TF**     | Frequency of term in document                       |
| **IDF**    | Rareness of term across documents                   |
| **TF-IDF** | Importance of term (TF √ó IDF)                       |
| **Goal**   | Highlight unique, meaningful words in each document |

---

Would you like me to **generate a well-formatted PDF** of these TF-IDF notes (with formulas, examples, and tables) for easy study or printing?
