Text Mining : TF-IDF and Cosine Similarity from Scratch

Table of Contents:

1. Term Frequency (TF)
2. Inverse Document Frequency (IDF)
3. TF * IDF
4. Vector Space Models and Representation – Cosine Similarity

*** Any feedback or feature requests are welcome!***


Let us imagine that you are doing a search on below documents with the following query: **life learning**

* **Document 1** : I want to start learning to charge something in life.
* **Document 2** : learning something about me no one else knows
* **Document 3** : Never stop learning

The query is a free text query. It means a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators.

Let us go over each step in detail to see how it all works.

# 1.Term Frequency(TF)
Term Frequency also known as TF measures the number of times a term (word) occurs in a document. Given below is the code and the terms and their frequency on each of the document.


In [None]:
# Import necessary libraries
import math
import pandas as pd
import numpy as np

In [None]:
#documents
doc1 = "I want to start learning to charge something in life"
doc2 = "reading something about life no one else knows"
doc3 = "Never stop learning"
#query string
query = "life learning"

**NOTE :** Text Preprocessing Steps are ignored as the objective of this kernel is to explain and develop TF-IDF and cosine similarity from scratch

In [None]:
#term -frequenvy :word occurences in a document
def compute_tf(docs_list):
    # Iterate over each document
    for doc in docs_list:
        # Split the document into a list of words
        doc1_lst = doc.split(" ")
        # Create a dictionary with each unique word as keys and their counts to 0
        wordDict_1= dict.fromkeys(set(doc1_lst), 0)
        # Count the occurrences of each word in document
        for token in doc1_lst:
            # Increment count for each word occurrence
            wordDict_1[token] +=  1
        # Create a DataFrame from the dictionary where of words and their frequencies
        df = pd.DataFrame([wordDict_1])
        # Insert a new column at the first position
        idx = 0
        # This will label the rows as 'Term Frequency'
        new_col = ["Term Frequency"]
        df.insert(loc=idx, column='Document', value=new_col)
        print(df)

compute_tf([doc1, doc2, doc3])

* In reality each document will be of different size. On a large document the frequency of the terms will be much higher than the smaller ones. Hence we need to normalize the document based on its size.
* A simple trick is to divide the term frequency by the total number of terms.
* For example in Document 1 the term game occurs two times. The total number of terms in the document is 10. Hence the normalized term frequency is 2 / 10 = 0.2.


Given below are the normalized term frequency for all the documents.

In [None]:
#Normalized Term Frequency
def termFrequency(term, document):
    # Convert the document to lowercase and split into words
    normalizeDocument = document.lower().split()
    # Calculate the term frequency as number of occurrences of the term divided by total number of terms in document
    return normalizeDocument.count(term.lower()) / float(len(normalizeDocument))

def compute_normalizedtf(documents):
    # Initialize an empty list to store the term frequency dictionaries
    tf_doc = []
    # Iterate over each document in list of documents
    for txt in documents:
        # Split the document text into words
        sentence = txt.split()
        # Create a dictionary to keep track of normalized term frequency for unique words in the document
        norm_tf= dict.fromkeys(set(sentence), 0)
        # Calculate the normalized term frequency for each word and update the dictionary
        for word in sentence:
            norm_tf[word] = termFrequency(word, txt)
        # Append the dictionary to the list
        tf_doc.append(norm_tf)
        df = pd.DataFrame([norm_tf])
        idx = 0
        new_col = ["Normalized TF"]
        df.insert(loc=idx, column='Document', value=new_col)
        print(df)
    # Return the list of dictionaries containing the normalized term frequencies of each document
    return tf_doc

tf_doc = compute_normalizedtf([doc1, doc2, doc3])

# 2. Inverse Document Frequency (IDF)

* The main purpose of doing a search is to find out relevant documents matching the query.
* In Term Frequecy all terms are considered equally important. In fact certain terms that occur too frequently have little power in determining the relevance.
* We need a way to weigh down the effects of too frequently occurring terms. Also the terms that occur less in the document can be more relevant.
* We need a way to weigh up the effects of less frequently occurring terms. Logarithms helps us to solve this problem.Logarithms helps us to solve this problem.


Let us compute IDF for the term start

IDF(start) = 1 + loge(Total Number Of Documents / Number Of Documents with term start in it)

There are 3 documents in all = Document1, Document2, Document3
The term start appears in Document1

 IDF(start) = 1 + loge(3 / 1)
            = 1 + 1.098726209
            = 2.098726209

In [None]:
def inverseDocumentFrequency(term, allDocuments):
    # Initialize a counter for number of documents containing term
    numDocumentsWithThisTerm = 0
    # Loop through each document in the list to check if it contains the term
    for doc in range (0, len(allDocuments)):
        # Convert the document to lowercase, split into words, check if the term is in document
        if term.lower() in allDocuments[doc].lower().split():
            numDocumentsWithThisTerm = numDocumentsWithThisTerm + 1
    # Calculate the IDF using the logarithm scale formula if term is in any document
    if numDocumentsWithThisTerm > 0:
        return 1.0 + math.log(float(len(allDocuments)) / numDocumentsWithThisTerm)
    else:
        # Return 1.0 if the term is not found in any document
        return 1.0

def compute_idf(documents):
    idf_dict = {}
    for doc in documents:
        # Split the document into words
        sentence = doc.split()
        # Compute IDF for each word and update dict
        for word in sentence:
            idf_dict[word] = inverseDocumentFrequency(word, documents)
    return idf_dict
idf_dict = compute_idf([doc1, doc2, doc3])

compute_idf([doc1, doc2, doc3])

# 3.TF * IDF

Remember we are trying to find out relevant documents for the query: **life learning**

* For each term in the query multiply its normalized term frequency with its IDF on each document.
* In Document1 for the term life the normalized term frequency is 0.1 and its IDF is 1.405465108.
* Multiplying them together we get 0.140550715 (0.1 * 1.405465108).
*
Given below is TF * IDF calculations for life and learning in all the documents.

In [None]:
# tf-idf score across all docs for the query string("life learning")
def compute_tfidf_with_alldocs(documents , query):
    tf_idf = []
    index = 0
    # Split the query into individual words
    query_tokens = query.split()
    # Create DataFrame with column for each query token and index column for documents
    df = pd.DataFrame(columns=['doc'] + query_tokens)
    # Iterate over each document in documents list
    for doc in documents:
        df['doc'] = np.arange(0 , len(documents))
        # Retrieve normalized term frequency dictionary for document
        doc_num = tf_doc[index]
        sentence = doc.split()
        # Loop through each word in document
        for word in sentence:
            for text in query_tokens:
                if(text == word):
                    idx = sentence.index(word)
                    # Calculate the TF-IDF score by multiplying term frequency by inverse document frequency
                    tf_idf_score = doc_num[word] * idf_dict[word]
                    # Store TF-IDF score in DataFrame
                    tf_idf.append(tf_idf_score)
                    df.iloc[index, df.columns.get_loc(word)] = tf_idf_score
        index += 1
    # Fill missing values with 0 for terms that do not appear in document
    df.fillna(0 , axis=1, inplace=True)
    # Return list of TF-IDF scores and DataFrame containing scores for document and term
    return tf_idf , df

documents = [doc1, doc2, doc3]
tf_idf , df = compute_tfidf_with_alldocs(documents , query)
print(df)

# 4.Vector Space Models and Representation  – Cosine Similarity

The set of documents in a collection then is viewed as a set of vectors in a vector space. Each term will have its own axis. Using the formula given below we can find out the similarity between any two documents.

* > Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
* > Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
* > ||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2)
* > ||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2)


In [None]:
from IPython.display import Image
Image("../input/tfidf-kernel/cosinesimilarity.jpg")

* Vectors deals only with numbers. In this example we are dealing with text documents. This was the reason why we used TF and IDF to convert text into numbers so that it can be represented by a vecto


The query entered by the user can also be represented as a vector. We will calculate the TF*IDF for the query

In [None]:
#Normalized TF for the query string("life learning")
def compute_query_tf(query):
    query_norm_tf = {}
    tokens = query.split()
    for word in tokens:
        query_norm_tf[word] = termFrequency(word , query)
    return query_norm_tf
query_norm_tf = compute_query_tf(query)
print(query_norm_tf)

In [None]:
#idf score for the query string("life learning")
def compute_query_idf(query):
    idf_dict_qry = {}
    # Split the query into individual words
    sentence = query.split()
    documents = [doc1, doc2, doc3]
    # Loop over each token in query
    for word in sentence:
        # Calculate normalized term frequency of word
        idf_dict_qry[word] = inverseDocumentFrequency(word ,documents)
    # Return dictionary containing normalized term frequencies
    return idf_dict_qry
idf_dict_qry = compute_query_idf(query)
print(idf_dict_qry)

In [None]:
#tf-idf score for the query string("life learning")
def compute_query_tfidf(query):
    tfidf_dict_qry = {}
    sentence = query.split()
    # Loop over each token in query
    for word in sentence:
        # Calculate the TF-IDF score for each word by multiplying its normalized term
        # frequency (TF) by its inverse document frequency (IDF)
        tfidf_dict_qry[word] = query_norm_tf[word] * idf_dict_qry[word]
    return tfidf_dict_qry
tfidf_dict_qry = compute_query_tfidf(query)
print(tfidf_dict_qry)

Let us now calculate the cosine similarity of the query and Document1.

Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||Document1||

Dot product(Query, Document1)
     = ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715))
     = 0.197545035151

||Query|| = sqrt((0.702753576)2 + (0.702753576)2) = 0.993843638185

||Document1|| = sqrt((0.140550715)2 + (0.140550715)2) = 0.198768727354

Cosine Similarity(Query, Document) = 0.197545035151 / (0.993843638185) * (0.198768727354)
                                        = 0.197545035151 / 0.197545035151
                                        = 1

In [None]:
#Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||Document1||

"""
Example : Dot roduct(Query, Document1)

     life:
     = tfidf(life w.r.t query) * tfidf(life w.r.t Document1) +  /
     sqrt(tfidf(life w.r.t query)) *
     sqrt(tfidf(life w.r.t doc1))

     learning:
     =tfidf(learning w.r.t query) * tfidf(learning w.r.t Document1)/
     sqrt(tfidf(learning w.r.t query)) *
     sqrt(tfidf(learning w.r.t doc1))

"""
def cosine_similarity(tfidf_dict_qry, df , query , doc_num):
    # Initialize variables to calculate dot product and magnitudes
    dot_product = 0
    qry_mod = 0
    doc_mod = 0
    # Split the query into individual words
    tokens = query.split()
    # Calculate dot product and magnitudes for cosine similarity calculation
    for keyword in tokens:
        # Increment dot product by product of TF-IDF scores
        dot_product += tfidf_dict_qry[keyword] * df[keyword][df['doc'] == doc_num]
        #||Query||
        qry_mod += tfidf_dict_qry[keyword] * tfidf_dict_qry[keyword]
        #||Document||
        doc_mod += df[keyword][df['doc'] == doc_num] * df[keyword][df['doc'] == doc_num]
    # Take the square root of the Euclidean norm of query and document
    qry_mod = np.sqrt(qry_mod)
    doc_mod = np.sqrt(doc_mod)
    #implement formula
    denominator = qry_mod * doc_mod
    # Calculate cosine similarity as ratio of dot product to magnitudes
    cos_sim = dot_product/denominator

    return cos_sim

from collections import Iterable
def flatten(lis):
     for item in lis:
        # Check if item is iterable and not a string, to avoid splitting strings into characters
        if isinstance(item, Iterable) and not isinstance(item, str):
             # If the item is an iterable recursively flatten
             for x in flatten(item):
                yield x
        else:
             # If not yield it directly
             yield item


In [None]:
def rank_similarity_docs(data):
    cos_sim =[]
    for doc_num in range(0 , len(data)):
        # Calculate cosine similarity for current document and query
        # append the result to cos_sim list
        cos_sim.append(cosine_similarity(tfidf_dict_qry, df , query , doc_num).tolist())
    # Return list of cosine similarity scores
    return cos_sim
similarity_docs = rank_similarity_docs(documents)
doc_names = ["Document1", "Document2", "Document3"]
print(doc_names)
print(list(flatten(similarity_docs)))

* I plotted vector values for the query and documents in 2-dimensional space of life and learning. Document1 has the highest score of 1. This is not surprising as it has both the terms life and learning.

In [None]:
from IPython.display import Image
Image("../input/tfidf-kernel/cosinesimiarlity11.jpeg")