# TF - IDF (Term Frequency - Inverse Document Frequency)

The mathematical formula for **TF-IDF (Term Frequency-Inverse Document Frequency)** is:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Where:

### Term Frequency (TF):
$$
\text{TF}(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}
$$

- $f_{t, d}$: The frequency of term $t$ in document $d$.
- $\sum_{t' \in d} f_{t', d}$: The total number of terms in document $d$.

### Inverse Document Frequency (IDF):
$$
\text{IDF}(t, D) = \log \left( \frac{|D|}{1 + |d \in D : t \in d|} \right)
$$

- $|D|$: The total number of documents in the collection $D$.
- $|d \in D : t \in d|$: The number of documents in which the term $t$ appears. 
- $1$: Added to avoid division by zero in case $t$ does not appear in any document.

---

### Explanation:
1. **Where $t$ is the term**:
   - A single word or token for which the TF-IDF score is being calculated.
2. **Where $d$ is the document**:
   - A specific document from the corpus $D$ where the term $t$ is being considered.
3. **Where $D$ is the corpus**:
   - The entire collection of documents.
4. **Where $f_{t, d}$ is the term frequency**:
   - The raw count of term $t$ in document $d$.
5. **Where $\log$ is the logarithm function**:
   - Used to scale down the effect of IDF when a term appears in many documents.

Let me know if you'd like further explanation or examples!


In [31]:
import numpy as np
import pandas as pd

class TfidfVectorizer:

    def __init__(self):
        self.vocabulary = {}
        self.document_frequency = {}
        
    def fit(self, corpus): 
        for doc in corpus:
            for word in set(doc.lower().split()):
                if word not in self.vocabulary:
                    self.vocabulary[word] = len(self.vocabulary)

        # doc frequency
        self.document_frequency = {term: 0 for term in self.vocabulary}
        for doc in corpus:
            unique_terms = set(doc.lower().split())
            for word in unique_terms:
                self.document_frequency[word] += 1

        # IDF calculation
        self.inverse_document_frequency = {}
        N = len(corpus)
        for term, df in self.document_frequency.items():
            self.inverse_document_frequency[term] = float(np.log(N/df+1))

    def transform(self, corpus):
        tf_idf_matrix = np.zeros((len(corpus), len(self.vocabulary)))

        for i, doc in enumerate(corpus):
            term_count = {}
            for term in doc.lower().split():
                term_count[term] = term_count.get(term, 0) + 1

            for term, count in term_count.items():
                if term in self.vocabulary:
                    tf = count
                    idf = self.inverse_document_frequency[term]
                    tf_idf_matrix[i][self.vocabulary[term]] = tf * idf

        return tf_idf_matrix 
    
    def fit_transform(self, corpus):
        self.fit(corpus)
        result = self.transform(corpus)
        return result


corpus = [
    "apple pear",
    "banana Kiwi",
    "dragon fruit apple",
    "kiwi apple"
]

def format_matrix(vocab, matrix):
    if len(vocab) == len(matrix[0]):
        terms = list(vocab)
        return pd.DataFrame(
            data=matrix,
            columns=terms
        )
    else:
        raise ValueError("Vocabulary and Result matrix do not match")


tf_idf = TfidfVectorizer()
# tf_idf.fit(corpus)

result = tf_idf.fit_transform(corpus)

print(tf_idf.vocabulary)
print(tf_idf.document_frequency)
print(tf_idf.inverse_document_frequency)
print(f"Result:\n{format_matrix(tf_idf.vocabulary, result)}")

{'pear': 0, 'apple': 1, 'kiwi': 2, 'banana': 3, 'fruit': 4, 'dragon': 5}
{'pear': 1, 'apple': 3, 'kiwi': 2, 'banana': 1, 'fruit': 1, 'dragon': 1}
{'pear': 1.6094379124341003, 'apple': 0.8472978603872034, 'kiwi': 1.0986122886681098, 'banana': 1.6094379124341003, 'fruit': 1.6094379124341003, 'dragon': 1.6094379124341003}
Result:
       pear     apple      kiwi    banana     fruit    dragon
0  1.609438  0.847298  0.000000  0.000000  0.000000  0.000000
1  0.000000  0.000000  1.098612  1.609438  0.000000  0.000000
2  0.000000  0.847298  0.000000  0.000000  1.609438  1.609438
3  0.000000  0.847298  1.098612  0.000000  0.000000  0.000000
