# Building TF-IDF from Scratch
In this notebook, we will implement **Term Frequency - Inverse Document Frequency (TF-IDF)** from scratch using Python.
This is a fundamental technique in Natural Language Processing (NLP) for converting text data into numerical vectors.

### Goals
1. Implement **TF (Term Frequency)**.
2. Implement **IDF (Inverse Document Frequency)**.
3. Combine them to create **TF-IDF**.
4. Compare our results with `sklearn`.


In [1]:
import pandas as pd
import math
import numpy as np

# Sample Corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are great"
]
print("Corpus:", corpus)

Corpus: ['the cat sat on the mat', 'the dog sat on the log', 'cats and dogs are great']


## 1. Term Frequency (TF)

**Term Frequency** measures how frequently a term appears in a document.

### Formula
$$
TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of terms in document } d}
$$

### Exercise 1
Complete the function `compute_tf` below.


In [2]:
def compute_tf(document):
    """
    Computes TF for a single document (string).
    Returns a dictionary: {term: tf_value}
    """
    # Split the document into words (tokens)
    words = document.split()
    total_words = len(words)

    # 1. Count the frequency of each word
    word_counts = {}
    for word in words:
        # Increment count for each occurrence of the word
        word_counts[word] = word_counts.get(word, 0) + 1

    # 2. Calculate TF for each word
    tf_dict = {}

    for word, count in word_counts.items():
        # TF Formula: (count of term) / (total terms in document)
        tf_dict[word] = count / total_words

    return tf_dict

# Test with the first document
# Assuming 'corpus' is defined in your environment
print("TF for doc 0:", compute_tf(corpus[0]))

TF for doc 0: {'the': 0.3333333333333333, 'cat': 0.16666666666666666, 'sat': 0.16666666666666666, 'on': 0.16666666666666666, 'mat': 0.16666666666666666}


## 2. Inverse Document Frequency (IDF)

**IDF** measures how important a term is. While TF considers all terms equally important, IDF weighs down frequent terms (like "the", "is") and scales up rare terms.

### Formula
$$
IDF(t) = \log \left( \frac{N}{DF(t)} \right)
$$
Where:
* $N$ = Total number of documents.
* $DF(t)$ = Number of documents containing term $t$.

*Note: Use `math.log10` or `math.log` (natural log). For this exercise, simple log is fine.*

### Exercise 2
Complete the function `compute_idf`.


In [3]:
import math

def compute_idf(corpus):
    """
    Computes IDF for the entire corpus.
    Returns a dictionary: {term: idf_value}
    """
    # Total number of documents in the corpus
    N = len(corpus)

    # Dictionary to store how many documents contain each word
    all_words_df = {}

    for doc in corpus:
        # 1. Use set() to get unique words in the current document
        # This ensures we count a word only once per document (Document Frequency)
        words = set(doc.split())

        for word in words:
            # Increment the count of documents that contain this word
            all_words_df[word] = all_words_df.get(word, 0) + 1

    # 2. Calculate IDF for each word
    idf_dict = {}
    for word, df_count in all_words_df.items():
        # Formula: IDF = log(Total Documents / Number of docs containing the word)
        # Using math.log (natural logarithm)
        idf_dict[word] = math.log(N / df_count)

    return idf_dict

# Test it
# Assuming 'corpus' is a list of strings
idf_result = compute_idf(corpus)
print("IDF Result:", idf_result)

IDF Result: {'mat': 1.0986122886681098, 'the': 0.4054651081081644, 'sat': 0.4054651081081644, 'on': 0.4054651081081644, 'cat': 1.0986122886681098, 'dog': 1.0986122886681098, 'log': 1.0986122886681098, 'and': 1.0986122886681098, 'are': 1.0986122886681098, 'cats': 1.0986122886681098, 'dogs': 1.0986122886681098, 'great': 1.0986122886681098}


## 3. TF-IDF

Now we multiply them together:
$$
TF\text{-}IDF = TF(t, d) \times IDF(t)
$$

### Exercise 3
Create the full TF-IDF matrix for our corpus.


In [4]:
def compute_tfidf(corpus):
    # 1. First, compute IDF for the entire corpus (Global values)
    idf_dict = compute_idf(corpus)
    vectors = []

    for doc in corpus:
        # 2. Compute TF for the current document (Local values)
        tf_dict = compute_tf(doc)

        # Calculate TF-IDF for each word in the doc
        doc_tfidf = {}
        for word, tf_val in tf_dict.items():
            # TODO: Multiply TF * IDF
            # The IDF value is retrieved from our pre-computed idf_dict
            doc_tfidf[word] = tf_val * idf_dict[word]

        # Add the document's dictionary of weights to our list
        vectors.append(doc_tfidf)

    return vectors

# Run the pipeline
tfidf_vectors = compute_tfidf(corpus)

# Display as a DataFrame for better visibility
import pandas as pd
df = pd.DataFrame(tfidf_vectors)
df = df.fillna(0) # Fill NaN with 0 because a word might not exist in all documents
print("My TF-IDF Matrix:")
print(df)

My TF-IDF Matrix:
        the       cat       sat        on       mat       dog       log  \
0  0.135155  0.183102  0.067578  0.067578  0.183102  0.000000  0.000000   
1  0.135155  0.000000  0.067578  0.067578  0.000000  0.183102  0.183102   
2  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   

       cats       and      dogs       are     great  
0  0.000000  0.000000  0.000000  0.000000  0.000000  
1  0.000000  0.000000  0.000000  0.000000  0.000000  
2  0.219722  0.219722  0.219722  0.219722  0.219722  


## 4. Comparison with Scikit-Learn
Let's see how our results compare with a professional library.
Note: Sklearn uses a slightly different IDF formula (adds 1 smoothing, different normalization), so numbers wont be identical, but should be correlated.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Setup TfidfVectorizer
# norm=None: keeps the raw values without scaling them to a range (to match our manual calculation)
# smooth_idf=False: prevents adding 1 to the numerator/denominator of the IDF (to match our manual logic)
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False)

# Fit the model (learn the vocabulary and IDF) and transform the corpus (calculate TF-IDF)
sklearn_tfidf = vectorizer.fit_transform(corpus)

# Convert the sparse matrix to a dense array and create a DataFrame for better display
# get_feature_names_out() retrieves the unique words (columns) in order
df_sklearn = pd.DataFrame(
    sklearn_tfidf.toarray(),
    columns=vectorizer.get_feature_names_out()
)

print("Sklearn TF-IDF Matrix:")
print(df_sklearn)

Sklearn TF-IDF Matrix:
        and       are       cat      cats       dog      dogs     great  \
0  0.000000  0.000000  2.098612  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  2.098612  0.000000  0.000000   
2  2.098612  2.098612  0.000000  2.098612  0.000000  2.098612  2.098612   

        log       mat        on       sat      the  
0  0.000000  2.098612  1.405465  1.405465  2.81093  
1  2.098612  0.000000  1.405465  1.405465  2.81093  
2  0.000000  0.000000  0.000000  0.000000  0.00000  
