# Building TF-IDF from Scratch
In this notebook, we will implement **Term Frequency - Inverse Document Frequency (TF-IDF)** from scratch using Python. 
This is a fundamental technique in Natural Language Processing (NLP) for converting text data into numerical vectors.

### Goals
1. Implement **TF (Term Frequency)**.
2. Implement **IDF (Inverse Document Frequency)**.
3. Combine them to create **TF-IDF**.
4. Compare our results with `sklearn`.


In [None]:
import pandas as pd
import math
import numpy as np

# Sample Corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are great"
]
print("Corpus:", corpus)

## 1. Term Frequency (TF)

**Term Frequency** measures how frequently a term appears in a document. 

### Formula
$$
TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of terms in document } d}
$$

### Exercise 1
Complete the function `compute_tf` below.


In [None]:
def compute_tf(document):
    """
    Computes TF for a single document (string).
    Returns a dictionary: {term: tf_value}
    """
    # Split the document into words (tokens)
    words = # TODO
    total_words = len(words)
    
    # 1. Count the frequency of each word
    word_counts = {}
    for word in words:
        # TODO: Count occurrences of each word
        # word_counts[word] = ...
        pass 

    # 2. Calculate TF for each word
    tf_dict = {}
    
    for word, count in word_counts.items():
        # TODO: Calculate TF = (count of term) / (total terms)
        # tf_dict[word] = ...
        pass
        
    return tf_dict

# Test with the first document
print("TF for doc 0:", compute_tf(corpus[0]))


## 2. Inverse Document Frequency (IDF)

**IDF** measures how important a term is. While TF considers all terms equally important, IDF weighs down frequent terms (like "the", "is") and scales up rare terms.

### Formula
$$
IDF(t) = \log \left( \frac{N}{DF(t)} \right)
$$
Where:
* $N$ = Total number of documents.
* $DF(t)$ = Number of documents containing term $t$.

*Note: Use `math.log10` or `math.log` (natural log). For this exercise, simple log is fine.*

### Exercise 2
Complete the function `compute_idf`.


In [None]:
def compute_idf(corpus):
    """
    Computes IDF for the entire corpus.
    Returns a dictionary: {term: idf_value}
    """
    N = len(corpus)
    
    all_words_df = {}
    
    for doc in corpus:
        words = # TODO: Use set to count a word only once per doc
        
        for word in words:
            # TODO: Increment document frequency count for this word
            # if word in all_words_df: ...
            pass

    idf_dict = {}
    # 2. Calculate IDF
    for word, df_count in all_words_df.items():
        # TODO: Calculate IDF = log(N / df_count)
        # idf_dict[word] = ...
        pass
        
    return idf_dict

# Test it
idf_result = compute_idf(corpus)
print("IDF Result:", idf_result)


## 3. TF-IDF

Now we multiply them together:
$$
TF\text{-}IDF = TF(t, d) \times IDF(t)
$$

### Exercise 3
Create the full TF-IDF matrix for our corpus.


In [None]:
def compute_tfidf(corpus):
    idf_dict = compute_idf(corpus)
    vectors = []
    
    for doc in corpus:
        tf_dict = compute_tf(doc)
        
        # Calculate TF-IDF for each word in the doc
        doc_tfidf = {}
        for word, tf_val in tf_dict.items():
            # TODO: Multiply TF * IDF
            # doc_tfidf[word] = ...
            pass
        
        vectors.append(doc_tfidf)
    
    return vectors

# Run the pipeline
tfidf_vectors = compute_tfidf(corpus)

# Display as a DataFrame for better visibility
df = pd.DataFrame(tfidf_vectors)
df = df.fillna(0) # Fill NaN with 0 (words not in document)
print("My TF-IDF Matrix:")
print(df)


## 4. Comparison with Scikit-Learn
Let's see how our results compare with a professional library.
Note: Sklearn uses a slightly different IDF formula (adds 1 smoothing, different normalization), so numbers wont be identical, but should be correlated.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Setup TfidfVectorizer (defaults do normalization, we turned that off above for simplicity)
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False) # Trying to match simple logic

sklearn_tfidf = vectorizer.fit_transform(corpus)
df_sklearn = pd.DataFrame(sklearn_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

print("Sklearn TF-IDF Matrix:")
print(df_sklearn)
