# Natural Language Processing

### Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used in Natural Language Processing (NLP) to determine the importance of a word (term) in a document relative to a collection of documents (corpus). It helps identify words that are important for distinguishing one document from others.

#### Formula:
The TF-IDF score for a term \(x\) in a document \(y\) is calculated as:

\[
W_{x,y} = TF_{x,y} \cdot \log\left(\frac{N}{DF_x}\right)
\]

Where:
- \(W_{x,y}\): TF-IDF score of term \(x\) in document \(y\)
- \(TF_{x,y}\): **Term Frequency** — how often the term \(x\) appears in document \(y\)
- \(N\): Total number of documents in the corpus
- \(DF_x\): **Document Frequency** — the number of documents containing the term \(x\)

---

#### Components:
1. **Term Frequency (TF):**
   Measures the frequency of the term in the document:
   \[
   TF_{x,y} = \frac{\text{Frequency of } x \text{ in } y}{\text{Total terms in document } y}
   \]

2. **Inverse Document Frequency (IDF):**
   Penalizes common terms that appear across many documents. Rarer terms are given higher weight:
   \[
   IDF_x = \log\left(\frac{N}{DF_x}\right)
   \]

---

#### Why This Formula Works:
1. **TF measures local importance**:
   Words that occur frequently in a document are likely important for that specific document.

2. **IDF measures global importance**:
   Common words (e.g., "the", "is", "and") appear in many documents and are less useful for distinguishing between them. IDF reduces their importance.

3. **Combining TF and IDF**:
   Multiplying \(TF\) and \(IDF\) ensures that a word is considered important only if it is frequent in a specific document and rare across the entire corpus.

---

#### Intuitive Example:
Consider two words in a corpus of movie reviews: **"great"** and **"movie"**.
- "great" appears frequently in a few specific reviews → **high TF, high IDF → important**.
- "movie" appears in almost all reviews → **high TF, low IDF → less important**.

TF-IDF ensures that "great" is assigned a higher weight because it is unique to certain documents, while "movie" gets a lower weight because it is common to all documents.

---

TF-IDF is widely used in:
- Keyword extraction
- Text classification
- Document similarity
- Information retrieval


In [2]:
import numpy as np
import pandas as pd