# TF-IDF

This notebook contains an explanation from ChatGPT on TF-IDF. Saving to play with at a later time.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word or a group of words (a phrase) in a document or a set of documents. It is commonly used in information retrieval and natural language processing tasks.

TF-IDF is calculated by multiplying the term frequency (TF) and the inverse document frequency (IDF). Let's take a closer look at each of these terms:

Term frequency (TF): The term frequency is a measure of how often a term appears in a document. It is calculated as the number of times a term appears in a document divided by the total number of terms in the document. For example, if the term "cat" appears 3 times in a document with a total of 100 terms, the term frequency for "cat" in that document is 0.03.

Inverse document frequency (IDF): The inverse document frequency is a measure of how rare a term is across a set of documents. It is calculated as the logarithm of the total number of documents divided by the number of documents that contain the term. For example, if the term "cat" appears in 50 out of 1000 documents, the inverse document frequency for "cat" is log(1000/50) = 1.3.

Here's an example of how to calculate the TF-IDF score for a term in Python:

In [7]:
import math

def _tf(term, document):
    # Count the number of times the term appears in the document
    term_count = document.count(term)

    # Calculate the total number of terms in the document
    total_terms = len(document.split())

    # Calculate the term frequency
    tf = term_count / total_terms

    return tf

def _idf(term, documents):
    # Count the number of documents that contain the term
    document_count = sum(1 for document in documents if term in document)
    # print(document_count)
    # Calculate the inverse document frequency
    idf = math.log(len(documents) / document_count)

    return idf

def tf_idf(term, document, documents):
    # Calculate the term frequency
    tf = _tf(term, document)

    # Calculate the inverse document frequency
    idf = _idf(term, documents)

    # Calculate the TF-IDF score
    tf_idf = tf * idf

    return tf_idf

# Calculate the TF-IDF score for the term "cat" in the document "The cat sat on the mat"
document = "The cat sat on the mat"
documents = ["The cat sat on the mat", "The dog chased the cat", "The dog chased the car"]
term = "cat"
print(_tf(term=term, document=document))
print(_idf(term=term, documents=documents))
tf_idf_score = tf_idf(term, document, documents)
print(tf_idf_score)

0.16666666666666666
0.4054651081081644
0.06757751801802739


In [8]:
tf_idf_score = tf_idf(term='The', document=document, documents=documents)
print(tf_idf_score)

0.0


In [5]:
math.log(1)

0.0

This code defines three functions: tf(), which calculates the term frequency; idf(), which calculates the inverse document frequency; and tf_idf(), which calculates the TF-IDF score. You can use these functions to calculate the TF-IDF scores for any term in any document or set of documents.