#TF-IDF

TF-IDF model is one such method to represent words in numerical values. TF-IDF stands for “Term Frequency – Inverse Document Frequency”.


This method removes the drawbacks faced by the bag of words model. it does not assign equal value to all the words, hence important words that occur a few times will be assigned high weights.

TF-IDF is the product of Term Frequency and Inverse Document Frequency. Here’s the formula for TF-IDF calculation.

`TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)`


**What is Term Frequency?**

It is the measure of the frequency of words in a document. It is the ratio of the number of times the word appears in a document compared to the total number of words in that document.

`tf(t,d) = count of t in d / number of words in d`


**What is Inverse Document Frequency?**

The words that occur rarely in the corpus have a high IDF score. It is the log of the ratio of the number of documents to the number of documents containing the word.

We take log of this ratio because when the corpus becomes large IDF values can get large causing it to explode hence taking log will dampen this effect.

we cannot divide by 0, we smoothen the value by adding 1 to the denominator.

`idf(t) = log(N/(df + 1))`

### Preprocess the data

We’ll start with preprocessing the text data, and make a vocabulary set of the words in our training data and assign a unique index for each word in the set.

In [1]:
import nltk
from nltk.tokenize import  word_tokenize

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
# Example text corpus for our tutorial

text = ['Topic sentences are similar to mini thesis statements.\
        Like a thesis statement, a topic sentence has a specific \
        main point. Whereas the thesis is the main point of the essay',\
        'the topic sentence is the main point of the paragraph.\
        Like the thesis statement, a topic sentence has a unifying function. \
        But a thesis statement or topic sentence alone doesn’t guarantee unity.', \
        'An essay is unified if all the paragraphs relate to the thesis,\
        whereas a paragraph is unified if all the sentences relate to the topic sentence.']

In [4]:
#Preprocessing the text data
sentences = []
word_set = []

In [5]:
for sent in text:
    x = [i.lower() for  i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)

In [6]:
#Set of vocab
word_set = set(word_set)

In [7]:
#Total documents in our corpus
total_documents = len(sentences)

In [8]:
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1

### Create a dictionary for keeping count


We then create a dictionary to keep the count of the number of documents containing the given word.

In [9]:
#Create a count dictionary

def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count

word_count = count_dict(sentences)

In [10]:
word_count

{'sentence': 3,
 'similar': 1,
 'the': 3,
 'to': 2,
 'are': 1,
 'thesis': 3,
 'all': 1,
 'topic': 3,
 'but': 1,
 'unified': 1,
 'mini': 1,
 'unifying': 1,
 'or': 1,
 'whereas': 2,
 'alone': 1,
 'paragraphs': 1,
 'statement': 2,
 'paragraph': 2,
 'relate': 1,
 'statements': 1,
 'if': 1,
 'main': 2,
 'an': 1,
 'point': 2,
 'like': 2,
 'unity': 1,
 'function': 1,
 'essay': 2,
 't': 1,
 'specific': 1,
 'of': 2,
 'sentences': 2,
 'is': 3,
 'a': 3,
 'doesn': 1,
 'guarantee': 1,
 'has': 2}

### Define a function to calculate Term Frequency

Define a function to count the term frequency (TF) first.

In [11]:
#Term Frequency
def term_freq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    return occurance/N

### Define a function calculate Inverse Document Frequency

Define another function for the Inverse Document Frequency (IDF)


In [12]:
import numpy as np

In [13]:
#Inverse Document Frequency
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1
    return np.log(total_documents/word_occurance)

### Combining the TF-IDF functions

In [14]:
def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = term_freq(sentence,word)
        idf = inverse_doc_freq(word)

        value = tf*idf
        tf_idf_vec[index_dict[word]] = value
    return tf_idf_vec

### Apply the TF-IDF Model to our text

The implementation of the TF-IDF model in Python is complete. Now, let’s pass the text corpus to the function and see what the output vector looks like.

In [15]:
#TF-IDF Encoded text corpus
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)
index_of_thesis = index_dict['thesis']
print(index_of_thesis)

print(vectors[0][index_of_thesis])
print(vectors[1][index_of_thesis])
print(vectors[2][index_of_thesis])

5
-0.02876820724517809
-0.017435277118289752
-0.011064695094299266


In [16]:
vectors

[array([-0.0095894 ,  0.0135155 , -0.02876821,  0.        ,  0.0135155 ,
        -0.02876821,  0.        , -0.0191788 ,  0.        ,  0.        ,
         0.0135155 ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.0135155 ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.0135155 ,
         0.        ,  0.        , -0.0095894 , -0.02876821,  0.        ,
         0.        ,  0.        ]),
 array([-0.02615292,  0.        , -0.03487055,  0.        ,  0.        ,
        -0.01743528,  0.        , -0.02615292,  0.01228682,  0.        ,
         0.        ,  0.01228682,  0.01228682,  0.        ,  0.01228682,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.01228682,  0.01228682,  0.        ,  0.01228682,  0.        ,
         0.    

## Use in classification : Euclidean distance

Euclidean distance is the shortest between the 2 points irrespective of the dimensions

`dist = np.linalg.norm(point1 - point2)`

In [17]:
dist_doc1_doc2 = np.linalg.norm(vectors[0] - vectors[1])
dist_doc1_doc3 = np.linalg.norm(vectors[0] - vectors[2])
dist_doc2_doc3 = np.linalg.norm(vectors[1] - vectors[2])

print(dist_doc1_doc2 , dist_doc1_doc3, dist_doc2_doc3)

0.05261463562637202 0.07989345472253513 0.08202330182165479
