# 06. TF-IDF Ranking

## Table of Contents
1. [Introduction](#introduction)
2. [Theory: TF-IDF](#theory)
3. [Computing TF-IDF](#computing)
4. [TF-IDF Retrieval](#retrieval)
5. [Comparison with TF](#comparison)
6. [Summary](#summary)

---

## 1. Introduction <a name="introduction"></a>

**TF-IDF** (Term Frequency - Inverse Document Frequency) is the most important term weighting scheme in Information Retrieval.

### The Problem:
Simple term frequency treats all words equally. But:
- **Common words** (‡§¶‡•á‡§∂, ‡§∞‡§æ‡§∑‡•ç‡§ü‡•ç‡§∞) appear in many documents ‚Üí Less distinctive
- **Rare words** (‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ, ‡§π‡§ø‡§Æ‡§æ‡§≤) appear in few documents ‚Üí More distinctive

**TF-IDF Solution:** Give higher weight to rare, discriminative terms!

---

## 2. Theory: TF-IDF <a name="theory"></a>

### Formula:

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

### 1. Term Frequency (TF):
How often term $t$ appears in document $d$.

$$
\text{TF}(t, d) = \text{count of } t \text{ in } d
$$

**Variants:**
- Raw count: $\text{freq}(t, d)$
- Log normalization: $1 + \log(\text{freq}(t, d))$ ‚Üê We'll use this
- Boolean: $1$ if present, $0$ otherwise

### 2. Inverse Document Frequency (IDF):
How rare the term is across all documents.

$$
\text{IDF}(t) = \log\left(\frac{N}{\text{df}(t)}\right)
$$

Where:
- $N$ = Total number of documents
- $\text{df}(t)$ = Number of documents containing term $t$

### Intuition:

```
Term: "‡§®‡•á‡§™‡§æ‡§≤"  ‚Üí appears in 8/10 docs ‚Üí IDF = log(10/8) = 0.22 (low)
Term: "‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ" ‚Üí appears in 1/10 docs ‚Üí IDF = log(10/1) = 2.30 (high)
```

**Result:** Rare terms get higher weights!

### Why Logarithm?
- **Smoothing**: Prevents extreme values
- **Diminishing returns**: Frequency 10‚Üí20 less important than 1‚Üí2
- **Empirically proven**: Works best in practice

---

## 3. Computing TF-IDF <a name="computing"></a>

In [1]:
from pathlib import Path
from collections import Counter
import math

# Load data
DATA_DIR = Path('../data')

def load_documents(data_dir):
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def load_stopwords(file_path):
    stopwords = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            stopwords.add(line.strip())
    return stopwords

def load_stemming_dict(file_path):
    stem_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            parts = line.strip().split(',')
            if len(parts) == 2:
                stem_dict[parts[0]] = parts[1]
    return stem_dict

def tokenize(text):
    tokens = text.split()
    cleaned = []
    for token in tokens:
        token = token.strip('‡•§,.!?;:"\'-()[]{}/')
        if token and any('\u0900' <= c <= '\u097F' for c in token):
            cleaned.append(token)
    return cleaned

def preprocess_text(text, stopwords, stem_dict):
    tokens = tokenize(text)
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [stem_dict.get(t, t) for t in tokens]
    return tokens

documents = load_documents(DATA_DIR)
stopwords = load_stopwords(DATA_DIR / 'nepali_stopwords.csv')
stem_dict = load_stemming_dict(DATA_DIR / 'nepali_stemming.csv')

preprocessed_docs = {}
for doc_id, text in documents.items():
    preprocessed_docs[doc_id] = preprocess_text(text, stopwords, stem_dict)

print(f"‚úì Loaded {len(preprocessed_docs)} documents")

‚úì Loaded 10 documents


In [2]:
def compute_document_frequency(preprocessed_docs):
    """
    Compute document frequency for each term.
    
    DF(term) = number of documents containing the term
    
    Returns:
    --------
    dict : term ‚Üí document frequency
    """
    df = {}
    
    for terms in preprocessed_docs.values():
        # Get unique terms in this document
        unique_terms = set(terms)
        
        # Increment DF for each unique term
        for term in unique_terms:
            df[term] = df.get(term, 0) + 1
    
    return df

def compute_idf(df, num_documents):
    """
    Compute IDF for each term.
    
    IDF(term) = log(N / DF(term))
    
    Parameters:
    -----------
    df : dict
        Document frequency for each term
    num_documents : int
        Total number of documents
    
    Returns:
    --------
    dict : term ‚Üí IDF value
    """
    idf = {}
    
    for term, doc_freq in df.items():
        # IDF = log(N / df)
        idf[term] = math.log(num_documents / doc_freq)
    
    return idf

# Compute DF and IDF
df = compute_document_frequency(preprocessed_docs)
idf = compute_idf(df, len(preprocessed_docs))

print(f"‚úì Computed IDF for {len(idf)} terms")

‚úì Computed IDF for 398 terms


In [3]:
# Examine IDF values
def show_idf_examples(idf, df, num_docs):
    """
    Display IDF values for terms with different rarities.
    """
    # Sort terms by IDF (high to low)
    sorted_terms = sorted(idf.items(), key=lambda x: x[1], reverse=True)
    
    print("\nüìä IDF Values (High = Rare, Low = Common):")
    print("="*80)
    print(f"{'Term':<20} {'DF (# docs)':<15} {'IDF Value':<15} {'Category'}")
    print("="*80)
    
    # Show top 5 (rarest)
    print("\nüîπ Rare Terms (High IDF):")
    for term, idf_val in sorted_terms[:5]:
        doc_freq = df[term]
        print(f"{term:<20} {doc_freq:<15} {idf_val:<15.4f} Very distinctive")
    
    # Show bottom 5 (most common)
    print("\nüî∏ Common Terms (Low IDF):")
    for term, idf_val in sorted_terms[-5:]:
        doc_freq = df[term]
        print(f"{term:<20} {doc_freq:<15} {idf_val:<15.4f} Less distinctive")
    
    print("="*80)

show_idf_examples(idf, df, len(preprocessed_docs))


üìä IDF Values (High = Rare, Low = Common):
Term                 DF (# docs)     IDF Value       Category

üîπ Rare Terms (High IDF):
‡§µ‡§ø‡§∂‡•ç‡§µ‡§≠‡§∞              1               2.3026          Very distinctive
‡§®‡•á‡§™‡§æ‡§≤‡§≤‡§æ‡§à             1               2.3026          Very distinctive
‡§ö‡§æ‡§°‡§™‡§∞‡•ç‡§µ‡§π‡§∞‡•Ç           1               2.3026          Very distinctive
‡§∂‡§æ‡§π‡§≤‡•á                1               2.3026          Very distinctive
‡•ß‡•≠‡•¨‡•Æ                 1               2.3026          Very distinctive

üî∏ Common Terms (Low IDF):
‡§≠‡•Ç‡§Æ‡§ø‡§ï‡§æ               3               1.2040          Less distinctive
‡§µ‡§ø‡§ï‡§æ‡§∏                4               0.9163          Less distinctive
‡§∞‡§æ‡§∑‡•ç‡§ü‡•ç‡§∞‡§ø‡§Ø            4               0.9163          Less distinctive
‡§Æ‡§π‡§§‡•ç‡§µ‡§™‡•Ç‡§∞‡•ç‡§£           5               0.6931          Less distinctive
‡§®‡•á‡§™‡§æ‡§≤                10              0.0000          Less distinc

In [4]:
def compute_tf_log_normalized(term_freq):
    """
    Compute log-normalized TF.
    
    TF = 1 + log(freq) if freq > 0, else 0
    """
    if term_freq == 0:
        return 0
    return 1 + math.log(term_freq)

def build_tfidf_vectors(preprocessed_docs, vocabulary, idf):
    """
    Build TF-IDF weighted document vectors.
    
    For each document and term:
    weight = (1 + log(TF)) √ó IDF
    
    Parameters:
    -----------
    preprocessed_docs : dict
        Document ID ‚Üí list of terms
    vocabulary : list
        Sorted list of all unique terms
    idf : dict
        Term ‚Üí IDF value
    
    Returns:
    --------
    dict : Document ID ‚Üí TF-IDF vector
    """
    vectors = {}
    term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}
    
    for doc_id, terms in preprocessed_docs.items():
        # Count term frequencies
        term_counts = Counter(terms)
        
        # Build TF-IDF vector
        vector = [0.0] * len(vocabulary)
        
        for term, freq in term_counts.items():
            idx = term_to_idx[term]
            
            # TF component (log normalized)
            tf = compute_tf_log_normalized(freq)
            
            # TF-IDF = TF √ó IDF
            tfidf = tf * idf[term]
            
            vector[idx] = tfidf
        
        vectors[doc_id] = vector
    
    return vectors

# Build vocabulary and TF-IDF vectors
vocabulary = sorted(set(term for terms in preprocessed_docs.values() for term in terms))
tfidf_vectors = build_tfidf_vectors(preprocessed_docs, vocabulary, idf)

print(f"\n‚úì Built TF-IDF vectors")
print(f"  Vocabulary size: {len(vocabulary)}")
print(f"  Number of vectors: {len(tfidf_vectors)}")


‚úì Built TF-IDF vectors
  Vocabulary size: 398
  Number of vectors: 10


In [5]:
# Compare TF vs TF-IDF weights for a sample document
def compare_tf_tfidf(doc_id, preprocessed_docs, vocabulary, idf, num_terms=10):
    """
    Compare TF and TF-IDF weights for terms in a document.
    """
    terms = preprocessed_docs[doc_id]
    term_counts = Counter(terms)
    
    # Compute weights
    weights = []
    for term, freq in term_counts.items():
        tf = freq
        tf_log = compute_tf_log_normalized(freq)
        tfidf = tf_log * idf[term]
        weights.append((term, tf, tf_log, idf[term], tfidf))
    
    # Sort by TF-IDF (descending)
    weights.sort(key=lambda x: x[4], reverse=True)
    
    print(f"\nüìÑ Document: {doc_id}")
    print(f"Title: {documents[doc_id].split(chr(10))[0]}")
    print("="*80)
    print(f"{'Term':<15} {'TF':<8} {'TF(log)':<10} {'IDF':<10} {'TF-IDF'}")
    print("="*80)
    
    for term, tf, tf_log, idf_val, tfidf in weights[:num_terms]:
        print(f"{term:<15} {tf:<8} {tf_log:<10.3f} {idf_val:<10.3f} {tfidf:.3f}")
    
    print("="*80)

# Example for doc02 (about Himalayas and tourism)
compare_tf_tfidf('doc02', preprocessed_docs, vocabulary, idf)


üìÑ Document: doc02
Title: ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®
Term            TF       TF(log)    IDF        TF-IDF
‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï          4        2.386      2.303      5.495
‡§Ü‡§â‡§Å‡§õ‡§®‡•ç          2        1.693      2.303      3.899
‡§∏‡•ç‡§§‡•Ç‡§™           2        1.693      2.303      3.899
‡§π‡§ø‡§Æ‡§æ‡§≤           5        2.609      1.204      3.142
‡§µ‡§ø‡§∂‡•ç‡§µ‡§ï‡•ã         1        1.000      2.303      2.303
‡§Ö‡§ó‡•ç‡§≤‡•ã           1        1.000      2.303      2.303
‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ‡§ï‡•ã       1        1.000      2.303      2.303
‡§â‡§ö‡§æ‡§á            1        1.000      2.303      2.303
‡•Æ,‡•Æ‡•™‡•Æ.‡•Æ‡•¨        1        1.000      2.303      2.303
‡§Æ‡§ø‡§ü‡§∞            1        1.000      2.303      2.303


---

## 4. TF-IDF Retrieval <a name="retrieval"></a>

In [6]:
def dot_product(vec1, vec2):
    return sum(v1 * v2 for v1, v2 in zip(vec1, vec2))

def vector_magnitude(vec):
    return math.sqrt(sum(v ** 2 for v in vec))

def cosine_similarity(vec1, vec2):
    dot_prod = dot_product(vec1, vec2)
    mag1 = vector_magnitude(vec1)
    mag2 = vector_magnitude(vec2)
    
    if mag1 == 0 or mag2 == 0:
        return 0.0
    
    return dot_prod / (mag1 * mag2)

def query_to_tfidf_vector(query_text, vocabulary, stopwords, stem_dict, idf):
    """
    Convert query to TF-IDF vector.
    """
    # Preprocess query
    query_terms = preprocess_text(query_text, stopwords, stem_dict)
    term_counts = Counter(query_terms)
    
    # Build TF-IDF vector
    term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}
    vector = [0.0] * len(vocabulary)
    
    for term, freq in term_counts.items():
        if term in term_to_idx:
            idx = term_to_idx[term]
            tf = compute_tf_log_normalized(freq)
            tfidf = tf * idf.get(term, 0)
            vector[idx] = tfidf
    
    return vector

def tfidf_ranked_retrieval(query_text, tfidf_vectors, vocabulary, stopwords, stem_dict, idf, top_k=5):
    """
    Retrieve and rank documents using TF-IDF weighting.
    """
    # Convert query to TF-IDF vector
    query_vector = query_to_tfidf_vector(query_text, vocabulary, stopwords, stem_dict, idf)
    
    # Calculate similarity scores
    scores = []
    for doc_id, doc_vector in tfidf_vectors.items():
        score = cosine_similarity(query_vector, doc_vector)
        scores.append((doc_id, score))
    
    # Sort by score (descending)
    scores.sort(key=lambda x: x[1], reverse=True)
    
    return scores[:top_k]

print("‚úì TF-IDF retrieval functions defined")

‚úì TF-IDF retrieval functions defined


In [7]:
# Test Query 1
query1 = "‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ"
print(f"\nüîç Query: '{query1}'")
print("="*70)

results = tfidf_ranked_retrieval(query1, tfidf_vectors, vocabulary, stopwords, stem_dict, idf, top_k=5)

print(f"\n{'Rank':<6} {'Doc ID':<10} {'Score':<10} {'Title'}")
print("="*70)

for rank, (doc_id, score) in enumerate(results, 1):
    title = documents[doc_id].split('\n')[0]
    print(f"{rank:<6} {doc_id:<10} {score:.4f}    {title}")

print("="*70)


üîç Query: '‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ'

Rank   Doc ID     Score      Title
1      doc02      0.1942    ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®
2      doc09      0.1404    ‡§µ‡§æ‡§§‡§æ‡§µ‡§∞‡§£ ‡§∞ ‡§ú‡§≤‡§µ‡§æ‡§Ø‡•Å
3      doc01      0.0436    ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø
4      doc03      0.0000    ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§∞ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø
5      doc04      0.0000    ‡§ï‡•É‡§∑‡§ø ‡§∞ ‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞


In [8]:
# Test Query 2
query2 = "‡§µ‡§ø‡§∂‡•ç‡§µ‡§µ‡§ø‡§¶‡•ç‡§Ø‡§æ‡§≤‡§Ø ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø ‡§°‡§ø‡§ú‡§ø‡§ü‡§≤"
print(f"\nüîç Query: '{query2}'")
print("="*70)

results = tfidf_ranked_retrieval(query2, tfidf_vectors, vocabulary, stopwords, stem_dict, idf, top_k=5)

print(f"\n{'Rank':<6} {'Doc ID':<10} {'Score':<10} {'Title'}")
print("="*70)

for rank, (doc_id, score) in enumerate(results, 1):
    title = documents[doc_id].split('\n')[0]
    print(f"{rank:<6} {doc_id:<10} {score:.4f}    {title}")

print("="*70)


üîç Query: '‡§µ‡§ø‡§∂‡•ç‡§µ‡§µ‡§ø‡§¶‡•ç‡§Ø‡§æ‡§≤‡§Ø ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø ‡§°‡§ø‡§ú‡§ø‡§ü‡§≤'

Rank   Doc ID     Score      Title
1      doc03      0.4482    ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§∞ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø
2      doc07      0.0498    ‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø ‡§∏‡•á‡§µ‡§æ
3      doc08      0.0460    ‡§Ø‡§æ‡§§‡§æ‡§Ø‡§æ‡§§ ‡§∞ ‡§∏‡§û‡•ç‡§ö‡§æ‡§∞
4      doc01      0.0000    ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø
5      doc02      0.0000    ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®


---

## 5. Comparison with TF <a name="comparison"></a>

Let's compare TF-only vs TF-IDF ranking.

In [9]:
# Build simple TF vectors for comparison
def build_tf_vectors(preprocessed_docs, vocabulary):
    """Build vectors with raw term frequency."""
    vectors = {}
    term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}
    
    for doc_id, terms in preprocessed_docs.items():
        term_counts = Counter(terms)
        vector = [0] * len(vocabulary)
        
        for term, count in term_counts.items():
            idx = term_to_idx[term]
            vector[idx] = count
        
        vectors[doc_id] = vector
    
    return vectors

tf_vectors = build_tf_vectors(preprocessed_docs, vocabulary)

def tf_ranked_retrieval(query_text, tf_vectors, vocabulary, stopwords, stem_dict, top_k=5):
    """Retrieve using simple TF weighting."""
    query_terms = preprocess_text(query_text, stopwords, stem_dict)
    term_counts = Counter(query_terms)
    
    term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}
    query_vector = [0] * len(vocabulary)
    
    for term, count in term_counts.items():
        if term in term_to_idx:
            query_vector[term_to_idx[term]] = count
    
    scores = []
    for doc_id, doc_vector in tf_vectors.items():
        score = cosine_similarity(query_vector, doc_vector)
        scores.append((doc_id, score))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

# Compare on the same query
query = "‡§µ‡§ø‡§∂‡§ø‡§∑‡•ç‡§ü ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§ö‡§ø‡§§‡•Å‡§µ‡§æ"

print(f"\nüîç Comparison Query: '{query}'")
print("\nThis query contains rare terms that should be weighted heavily.\n")
print("="*70)

# TF-only results
print("\nüìä TF-only Ranking:")
tf_results = tf_ranked_retrieval(query, tf_vectors, vocabulary, stopwords, stem_dict, top_k=5)
for rank, (doc_id, score) in enumerate(tf_results, 1):
    title = documents[doc_id].split('\n')[0]
    print(f"  {rank}. {doc_id} ({score:.4f}) - {title}")

# TF-IDF results
print("\nüìä TF-IDF Ranking:")
tfidf_results = tfidf_ranked_retrieval(query, tfidf_vectors, vocabulary, stopwords, stem_dict, idf, top_k=5)
for rank, (doc_id, score) in enumerate(tfidf_results, 1):
    title = documents[doc_id].split('\n')[0]
    print(f"  {rank}. {doc_id} ({score:.4f}) - {title}")

print("="*70)
print("\nüí° TF-IDF gives higher weight to rare, discriminative terms!")


üîç Comparison Query: '‡§µ‡§ø‡§∂‡§ø‡§∑‡•ç‡§ü ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§ö‡§ø‡§§‡•Å‡§µ‡§æ'

This query contains rare terms that should be weighted heavily.


üìä TF-only Ranking:
  1. doc02 (0.3418) - ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®
  2. doc09 (0.1846) - ‡§µ‡§æ‡§§‡§æ‡§µ‡§∞‡§£ ‡§∞ ‡§ú‡§≤‡§µ‡§æ‡§Ø‡•Å
  3. doc01 (0.0671) - ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø
  4. doc03 (0.0000) - ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§∞ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø
  5. doc04 (0.0000) - ‡§ï‡•É‡§∑‡§ø ‡§∞ ‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞

üìä TF-IDF Ranking:
  1. doc09 (0.1670) - ‡§µ‡§æ‡§§‡§æ‡§µ‡§∞‡§£ ‡§∞ ‡§ú‡§≤‡§µ‡§æ‡§Ø‡•Å
  2. doc02 (0.0892) - ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®
  3. doc01 (0.0338) - ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø
  4. doc03 (0.0000) - ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§∞ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø
  5. doc04 (0.0000) - ‡§ï‡•É‡§∑‡§ø ‡§∞ ‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞

üí° TF-IDF gives higher weight to rare, discriminative terms!


---

## 6. Summary <a name="summary"></a>

### What We Learned:

1. **TF-IDF Formula**
   - Combines term frequency and inverse document frequency
   - TF-IDF = (1 + log TF) √ó log(N / DF)
   - Balances term importance and rarity

2. **IDF Component**
   - Rare terms get high IDF (high weight)
   - Common terms get low IDF (low weight)
   - log(N/DF) smooths the values

3. **TF Component**
   - Measures term importance within document
   - Log normalization prevents domination by high-frequency terms
   - 1 + log(freq) is standard

4. **Ranking Quality**
   - Better than simple TF
   - Emphasizes distinctive terms
   - Industry standard weighting scheme

### Key Insights:

**Why TF-IDF Works:**
- ‚úì **Discriminative**: Rare terms are more informative
- ‚úì **Balanced**: Combines local (TF) and global (IDF) statistics
- ‚úì **Robust**: Works well across different domains
- ‚úì **Efficient**: Simple to compute and understand

**Example:**
```
Query: "‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ ‡§π‡§ø‡§Æ‡§æ‡§≤"

- "‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ" (Sagarmatha/Everest): Rare ‚Üí High IDF ‚Üí High weight
- "‡§π‡§ø‡§Æ‡§æ‡§≤" (Mountain): More common ‚Üí Lower IDF ‚Üí Lower weight
- "‡§®‡•á‡§™‡§æ‡§≤" (Nepal): Very common ‚Üí Very low IDF ‚Üí Very low weight

Result: Documents about Sagarmatha ranked highest!
```

### Real-World Usage:
- **Google (early versions)**: Used TF-IDF as foundation
- **Lucene/Elasticsearch**: Default scoring mechanism
- **Text Mining**: Feature extraction for ML
- **Document Clustering**: Similarity computation

### Next Steps:
In the next notebook (`07_language_modeling.ipynb`), we will:
- Explore probabilistic IR models
- Learn language modeling approach
- Compare with TF-IDF
- Understand smoothing techniques

### Research References:
- Salton & Buckley (1988): "Term-weighting approaches in automatic text retrieval"
- Manning et al., "Introduction to Information Retrieval", Chapter 6
- Most cited and successful IR weighting scheme
- Foundation for modern ranking algorithms (BM25, etc.)