<a href="https://colab.research.google.com/github/shmokhsidi79-design/week5/blob/main/TF_TFIDF_From_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Building TF-IDF from Scratch
In this notebook, we will implement Term Frequency - Inverse Document Frequency (TF-IDF) from scratch using Python. This is a fundamental technique in Natural Language Processing (NLP) for converting text data into numerical vectors.


In [9]:
import pandas as pd
import math
import numpy as np


corpus = [
    "i love web development",
    "i love cats",
    "i hate dogs"
]

print("Corpus:", corpus)


Corpus: ['i love web development', 'i love cats', 'i hate dogs']


##1. Term Frequency (TF)

In [8]:
def compute_tf(document):
  # دالة تستقبل متغير دوكمنت

    # تقسيم الملف لتوكينز
    words = document.lower().split()
    total_words = len(words) #حساب طول الكلمات

    # نعد كم مرة تكررت عندنا الكلمة
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1

    # حساب الtf
    tf_dict = {}
    for word, count in word_counts.items():
        tf_dict[word] = count / total_words

    return tf_dict

print("TF for doc 0:", compute_tf(corpus[0]))


TF for doc 0: {'i': 0.25, 'love': 0.25, 'web': 0.25, 'development': 0.25}


##2. Inverse Document Frequency (IDF)

In [7]:
def compute_idf(corpus):

    N = len(corpus)
    #نحسب عدد الدوكمنتس  في الـ corpus
    all_words_df = {}
    #قاموس فاضي للتخزين فقط

    for doc in corpus:
        words = set(doc.lower().split())  #عداد للكلمة كم تكررت بالدوكمنت

        for word in words:
            all_words_df[word] = all_words_df.get(word, 0) + 1

    idf_dict = {}

    # 2. حساب IDF
    for word, df_count in all_words_df.items():
        idf_dict[word] = math.log(N / df_count)

    return idf_dict

idf_result = compute_idf(corpus)
print("IDF Result:", idf_result)


IDF Result: {'web': 1.0986122886681098, 'development': 1.0986122886681098, 'love': 0.4054651081081644, 'i': 0.0, 'cats': 1.0986122886681098, 'hate': 1.0986122886681098, 'dogs': 1.0986122886681098}


##3. TF-IDF
Now we multiply them together:

In [6]:
def compute_tfidf(corpus):
    # نحسب IDF لكل الكلمات في الـ corpus
    idf_dict = compute_idf(corpus)

    # قائمة لتخزين TF-IDF لكل
    numbers = []

    # نلف على كل  نص في corpus
    for doc in corpus:

        tf_dict = compute_tf(doc)


        doc_tfidf = {}

        for word, tf_val in tf_dict.items():
            # نحسب TF-IDF = TF * IDF
            doc_tfidf[word] = tf_val * idf_dict[word]

        # نضيف قاموس الدوكمنت إلى القائمة النهائية
        numbers.append(doc_tfidf)


    return numbers


# نحسب TF-IDF لكل وثيقة في الـ corpus
tfidf_numbers = compute_tfidf(corpus)

#حولت النتيجة لداتا فريم
df = pd.DataFrame(tfidf_numbers)

# نملأ القيم الفارغة بصفر
df = df.fillna(0)

# طباعة مصفوفة TF-IDF النهائية
print("TF-IDF:")
print("\n")
print(df)


TF-IDF:


     i      love       web  development      cats      hate      dogs
0  0.0  0.101366  0.274653     0.274653  0.000000  0.000000  0.000000
1  0.0  0.135155  0.000000     0.000000  0.366204  0.000000  0.000000
2  0.0  0.000000  0.000000     0.000000  0.000000  0.366204  0.366204


##4. Comparison with Scikit-Learn

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Setup TfidfVectorizer (defaults do normalization, we turned that off above for simplicity)
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False) # Trying to match simple logic

sklearn_tfidf = vectorizer.fit_transform(corpus)
df_sklearn = pd.DataFrame(sklearn_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

print("Sklearn TF-IDF Matrix:")
print(df_sklearn)


Sklearn TF-IDF Matrix:
       cats  development      dogs      hate      love       web
0  0.000000     2.098612  0.000000  0.000000  1.405465  2.098612
1  2.098612     0.000000  0.000000  0.000000  1.405465  0.000000
2  0.000000     0.000000  2.098612  2.098612  0.000000  0.000000
