# TF - IDF (Term Frequency - Inverse Document Frequency)

## Goal of TF IDF
### Characterisistics of terms with high TF-IDF score:
- High frequency within a document (TF)
- Unique to few documents (IDF)

### Why is this useful?
This is useful from a search POV because it helps identify words that are important to a document. For example, if you're searching for a document about the python programming language, you'd expect the words "python" and "programming" to appear frequently. By calculating the TF-IDF score, we can identify words that are important to a document while filtering out common words like "the", "is", "and", etc. that appear in many documents.

## Term Frequency
This measures how often a word appears in a document. Words that appear more frequently in a document are likely important for that document.

## Inverse Document Frequency
Inverse Document Frequency (IDF): This measures how unique or rare a word is across the entire corpus. If a word appears in many documents, it's less helpful in distinguishing one document from another (e.g., "the"). On the other hand, if a word appears in just a few documents, it's more distinctive.

## Combining TF and IDF

$$ TF-IDF = TF * IDF $$

High TF-IDF: The word is frequent in the document but rare in the corpus (important and unique).

Low TF-IDF: The word is either common across the corpus or not frequent in the document (less important).



## Mathematical formula

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Where:

### Term Frequency (TF):
$$
\text{TF}(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}
$$

- $f_{t, d}$: The frequency of term $t$ in document $d$.
- $\sum_{t' \in d} f_{t', d}$: The total number of terms in document $d$.

### Inverse Document Frequency (IDF):
$$
\text{IDF}(t, D) = \log \left( \frac{|D|}{1 + |d \in D : t \in d|} \right)
$$

- $|D|$: The total number of documents in the collection $D$.
- $|d \in D : t \in d|$: The number of documents in which the term $t$ appears. 
- $1$: Added to avoid division by zero in case $t$ does not appear in any document.

---

### Explanation:
1. **Where $t$ is the term**:
   - A single word or token for which the TF-IDF score is being calculated.
2. **Where $d$ is the document**:
   - A specific document from the corpus $D$ where the term $t$ is being considered.
3. **Where $D$ is the corpus**:
   - The entire collection of documents.
4. **Where $f_{t, d}$ is the term frequency**:
   - The raw count of term $t$ in document $d$.
5. **Where $\log$ is the logarithm function**:
   - Used to scale down the effect of IDF when a term appears in many documents.

Let me know if you'd like further explanation or examples!


In [7]:
import numpy as np
import pandas as pd

class TfidfVectorizer:

    def __init__(self):
        self.vocabulary = {}
        self.document_frequency = {}
        
    def fit(self, corpus): 
        for doc in corpus:
            for word in set(doc.lower().split()):
                if word not in self.vocabulary:
                    self.vocabulary[word] = len(self.vocabulary)

        # doc frequency
        self.document_frequency = {term: 0 for term in self.vocabulary}
        for doc in corpus:
            unique_terms = set(doc.lower().split())
            for word in unique_terms:
                self.document_frequency[word] += 1

        # IDF calculation
        self.inverse_document_frequency = {}
        N = len(corpus)
        for term, df in self.document_frequency.items():
            self.inverse_document_frequency[term] = float(np.log(N/df+1))

    def transform(self, corpus):
        tf_idf_matrix = np.zeros((len(corpus), len(self.vocabulary)))

        for i, doc in enumerate(corpus):
            term_count = {}
            for term in doc.lower().split():
                term_count[term] = term_count.get(term, 0) + 1

            for term, count in term_count.items():
                if term in self.vocabulary:
                    tf = count
                    idf = self.inverse_document_frequency[term]
                    tf_idf_matrix[i][self.vocabulary[term]] = tf * idf

        return tf_idf_matrix 
    
    def fit_transform(self, corpus):
        self.fit(corpus)
        result = self.transform(corpus)
        return result

corpus = [
    "Apple Apple Banana",
    "Banana Mango Banana",
    "Cherry Cherry Cherry",
    "Grapes Grapes Berries Grapes",
    "Apple Banana Mango",
    "Blueberries Strawberries Apple",
    "Apple Banana Mango",
    "Grapes Grapes Grapes",
    "Blueberries Apple Strawberries",
    "Apple Banana Apple",
    "Cherry Cherry Mango Cherry",
    "Blueberries Strawberries Cherry",
]

def format_matrix(vocab, matrix):
    if len(vocab) == len(matrix[0]):
        terms = list(vocab)
        return pd.DataFrame(
            data=matrix,
            columns=terms
        )
    else:
        raise ValueError("Vocabulary and Result matrix do not match")


tf_idf = TfidfVectorizer()
# tf_idf.fit(corpus)

result = tf_idf.fit_transform(corpus)

print(f"Vocab: {tf_idf.vocabulary}")
print(f"Document Frequency: {tf_idf.document_frequency}")
print(f"IDF: {tf_idf.inverse_document_frequency}")
print("Result:")
format_matrix(tf_idf.vocabulary, result)

Vocab: {'banana': 0, 'apple': 1, 'mango': 2, 'cherry': 3, 'grapes': 4, 'berries': 5, 'blueberries': 6, 'strawberries': 7}
Document Frequency: {'banana': 5, 'apple': 6, 'mango': 4, 'cherry': 3, 'grapes': 2, 'berries': 1, 'blueberries': 3, 'strawberries': 3}
IDF: {'banana': 1.2237754316221157, 'apple': 1.0986122886681098, 'mango': 1.3862943611198906, 'cherry': 1.6094379124341003, 'grapes': 1.9459101490553132, 'berries': 2.5649493574615367, 'blueberries': 1.6094379124341003, 'strawberries': 1.6094379124341003}
Result:


Unnamed: 0,banana,apple,mango,cherry,grapes,berries,blueberries,strawberries
0,1.223775,2.197225,0.0,0.0,0.0,0.0,0.0,0.0
1,2.447551,0.0,1.386294,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,4.828314,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,5.83773,2.564949,0.0,0.0
4,1.223775,1.098612,1.386294,0.0,0.0,0.0,0.0,0.0
5,0.0,1.098612,0.0,0.0,0.0,0.0,1.609438,1.609438
6,1.223775,1.098612,1.386294,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,5.83773,0.0,0.0,0.0
8,0.0,1.098612,0.0,0.0,0.0,0.0,1.609438,1.609438
9,1.223775,2.197225,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
import plotly.graph_objects as go

x_values = [df for _, df in tf_idf.document_frequency.items()]
y_values = [idf for _, idf in tf_idf.inverse_document_frequency.items()]

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=x_values,
        y=y_values,
        mode="markers+lines"
    )
)

fig.update_layout(
    title="DF vs IDF",
    xaxis_title="Document Frequency",
    yaxis_title="Inverse Document Frequency"
)

fig.show()

The above chart shows how the IDF score changes as per the number of documents in which the term appears. As the number of documents increases, the IDF score decreases. This is because the term becomes less unique as it appears in more documents.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_docs(query: str, corpus: list[str], result, tf_idf, top_k: int = 3):
    query_vector = tf_idf.transform([query])
    similarity_score = cosine_similarity(query_vector, result)
    ranked_indices = similarity_score.argsort()[0][::-1]
    retrieved_docs = [{"doc": corpus[i], "score": round(float(similarity_score[0][i]), 3)} for i in ranked_indices[:top_k]]
    return retrieved_docs

queries = [
    "Blueberries Strawberries",
    "grapes",
    "cherry",
    "banana mango"
]
query = queries[3]
print(f"Query: {query}")
docs = retrieve_docs(query, corpus, result, tf_idf, top_k=5)
docs

Query: banana mango


[{'doc': 'Banana Mango Banana', 'score': 0.945},
 {'doc': 'Apple Banana Mango', 'score': 0.86},
 {'doc': 'Apple Banana Mango', 'score': 0.86},
 {'doc': 'Apple Banana Apple', 'score': 0.322},
 {'doc': 'Apple Apple Banana', 'score': 0.322}]

### References

* https://www.youtube.com/watch?v=D3yL63aYNMQ&list=PLoROMvodv4rOwvldxftJTmoR3kRcWkJBp&index=16&pp=iAQB
* https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/