# Step 2 TF-IDF
TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY
- it is not just converting into number
- it will weight the importance of each words, giving less importance to stop words 
- without meaning, no order of word, only weights
- will not consider word order or meaning
- TF-IDF is a weighted BoW, not a semantic model.

## 🧠 Deep Dive into TF, DF, and IDF

Let’s take an example corpus:

```
Doc1: "This is a good book"
Doc2: "This book is about NLP"
Doc3: "I love this book"
Doc4: "This is amazing"
```

---

### 🔹 1️⃣ **Term Frequency (TF)**

> Measures how frequently a word occurs in a single document.

[
TF(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
]

👉 Example:

* In **Doc1**, “this” appears **1 time**
* Total words = 5
  → **TF("this", Doc1) = 1/5 = 0.2**

---

### 🔹 2️⃣ **Document Frequency (DF)**

> Measures how many documents contain the word at least once.

👉 Example:

* “this” appears in **Doc1, Doc2, Doc3, Doc4** → **DF = 4**
* “NLP” appears in **Doc2 only** → **DF = 1**

---

### 🔹 3️⃣ **Inverse Document Frequency (IDF)**

> Gives higher weight to **rare** words and lower weight to **common** ones.

[
IDF(t) = \log\left(\frac{N}{DF(t)}\right)
]
where `N` = total number of documents.

👉 Example:

* N = 4
* For “this”: DF = 4 → IDF = log(4/4) = log(1) = **0.0**
* For “NLP”: DF = 1 → IDF = log(4/1) = **1.386**

So “this” is **not important** (because it’s everywhere),
but “NLP” is **important** (because it’s rare).

---

### 🔹 4️⃣ **TF-IDF Weight**

[
TFIDF(t, d) = TF(t, d) \times IDF(t)
]

→ A word’s final weight depends both on:

* **Its frequency** in a document (TF)
* **Its rarity** across documents (IDF)

---

### 💡 Intuitive Summary

| Word | TF (Doc1)        | DF     | IDF      | TF-IDF Meaning  |
| ---- | ---------------- | ------ | -------- | --------------- |
| this | High TF, High DF | Common | Low IDF  | Low importance  |
| NLP  | Low TF, Low DF   | Rare   | High IDF | High importance |

So **TF-IDF = common within document + rare across documents** → *most meaningful*.


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "I love badminton and dancing",
    "Gardening is relaxing and peaceful",
    "I play badminton every weekend",
    "Dancing keeps me active and happy"
]

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)

print("Feature Names:", tfidf.get_feature_names_out())
print("TF-IDF Matrix:\n", X_tfidf.toarray())


Feature Names: ['active' 'and' 'badminton' 'dancing' 'every' 'gardening' 'happy' 'is'
 'keeps' 'love' 'me' 'peaceful' 'play' 'relaxing' 'weekend']
TF-IDF Matrix:
 [[0.         0.39205255 0.4842629  0.4842629  0.         0.
  0.         0.         0.         0.61422608 0.         0.
  0.         0.         0.        ]
 [0.         0.30403549 0.         0.         0.         0.47633035
  0.         0.47633035 0.         0.         0.         0.47633035
  0.         0.47633035 0.        ]
 [0.         0.         0.41428875 0.         0.52547275 0.
  0.         0.         0.         0.         0.         0.
  0.52547275 0.         0.52547275]
 [0.44592216 0.28462634 0.         0.35157015 0.         0.
  0.44592216 0.         0.44592216 0.         0.44592216 0.
  0.         0.         0.        ]]


Summary

1. term frequency
how many times the word occurs in a document

4 documents
this is book
this is a chair
this is a car
this is a toy

this occurs in 1 time in document 1

document1 total words is 3
termfreq = 1/3


2. document frequency
how many times the word occurs across all the documents
this word contains in how many documents


example
this word contains in 4 documents
book,car,chair word contains in 1 document



3. idf inverse document frequency

total no of documents is 4 

idf= log(N {total no of documents} / that word contains in how many documents)

THIS = log(4/4)  = log(1) = 0 it has no importance , no weight to word
book, chair, car = log(4/1) = 1.3 it has some importance weight


4. tf-idf 

In contrast, the TF-IDF for the word "this" would be:

TF("this") = 1/3 ≈ 0.33

IDF("this") = 0

Thus, the TF-IDF for "this" is:

TF-IDF("this") = 0.33 × 0 = 0
TF-IDF("this")=0.33×0=0


# Inference:
- By calculating this we can pass to machine learning models for further processing
- eg: text classification or clustering

