# TF-IDF: A Friendly Guide to Understanding Text Importance

## What is TF-IDF?
Imagine you're trying to understand how important words are in a collection of documents. TF-IDF is like a special detective that helps you figure out which words are truly meaningful.

## The Two-Step Magic: TF and IDF

### Step 1: Term Frequency (TF) - How Often Words Appear
Think of Term Frequency like counting how many times a word shows up in a single document.

Example:
- Document: "The cat sat on the mat. The cat is fluffy."
- Word counts:
  - "the": 2 times
  - "cat": 2 times
  - "sat": 1 time
  - "on": 1 time
  - "mat": 1 time
  - "is": 1 time
  - "fluffy": 1 time

### Step 2: Inverse Document Frequency (IDF) - How Unique a Word Is
IDF helps us understand how special or rare a word is across ALL documents.

Imagine you have three documents:
1. "The cat is fluffy"
2. "The dog is playful"
3. "The bird flies high"

- "the" appears in ALL documents (common word)
- "cat" appears in only ONE document (more unique)
- "fluffy" appears in only ONE document (very unique)

IDF gives more weight to unique words and less weight to common words.

### Putting TF and IDF Together

The TF-IDF score is calculated by multiplying Term Frequency (TF) and Inverse Document Frequency (IDF).

```python
TF-IDF = (Number of times word appears in a document) 
         × (Log of total documents ÷ Number of documents with the word)
```

In [None]:
import math

def calculate_tf(document):
    # Count word frequencies in a single document
    word_counts = {}
    words = document.lower().split()
    for word in words:
        # Increment word count
        word_counts[word] = word_counts.get(word, 0) + 1
    
    # Calculate Term Frequency
    total_words = len(words)
    tf = {word: count/total_words for word, count in word_counts.items()}
    return tf

In [2]:
def calculate_idf(documents):
    # Count in how many documents each word appears
    word_doc_count = {}
    total_docs = len(documents)
    
    for document in documents:
        unique_words = set(document.lower().split())
        for word in unique_words:
            word_doc_count[word] = word_doc_count.get(word, 0) + 1
    
    # Calculate Inverse Document Frequency
    idf = {word: math.log(total_docs / count) for word, count in word_doc_count.items()}
    return idf

In [3]:
def calculate_tfidf(documents):
    # Calculate TF-IDF for all documents
    tfidf_results = []
    idf = calculate_idf(documents)
    
    for document in documents:
        tf = calculate_tf(document)
        
        # Combine TF and IDF
        doc_tfidf = {word: tf_score * idf.get(word, 0) 
                     for word, tf_score in tf.items()}
        tfidf_results.append(doc_tfidf)
    
    return tfidf_results

In [4]:
# Example usage
documents = [
    "The cat is fluffy",
    "The dog is playful",
    "The bird flies high"
]

tfidf_scores = calculate_tfidf(documents)
for i, scores in enumerate(tfidf_scores):
    print(f"Document {i+1} TF-IDF Scores:")
    for word, score in sorted(scores.items(), key=lambda x: x[1], reverse=True):
        print(f"{word}: {score:.4f}")

Document 1 TF-IDF Scores:
cat: 0.2747
fluffy: 0.2747
is: 0.1014
the: 0.0000
Document 2 TF-IDF Scores:
dog: 0.2747
playful: 0.2747
is: 0.1014
the: 0.0000
Document 3 TF-IDF Scores:
bird: 0.2747
flies: 0.2747
high: 0.2747
the: 0.0000


## Real-World Applications
TF-IDF is used in:
- Search Engine Ranking
- Document Similarity
- Keyword Extraction
- Recommendation Systems

## Key Takeaways
1. TF shows how often a word appears in a document
2. IDF shows how unique or rare a word is
3. TF-IDF combines both to find truly important words