# An introduction to TF-IDF
TF-IDF stands for **Term Frequency — Inverse Data Frequency**.



**Term Frequency (tf)**: gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.


![./image/tfidf1.png](./image/tfidf1.png)

**Inverse Data Frequency (idf):** used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is given by the equation below.

![image/tfidf2.png](./image/tfidf2.png)

Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:



![/image/tfidf3.png](./image/tfidf3.png)

Let’s take an example to get a clearer understanding.

Sentence 1 : The car is driven on the road. The car is driven on the road.

Sentence 2: The truck is driven on the highway.

In this example, each sentence is a separate document.

We will now calculate the TF-IDF for the above two documents, which represent our corpus.

![image.png](./image/tfidf4.png)

In [1]:
docA = "The cat sat on my lap"
docB = "The dog sat on my bed"

In [2]:
bowA = docA.split(" ")
bowB = docB.split(" ")

In [3]:
bowB

['The', 'dog', 'sat', 'on', 'my', 'bed']

In [4]:
wordSet = set(bowA).union(set(bowB))

In [5]:
wordSet

{'The', 'bed', 'cat', 'dog', 'lap', 'my', 'on', 'sat'}

In [5]:
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0)

In [6]:
wordDictA

{'lap': 0, 'The': 0, 'dog': 0, 'my': 0, 'sat': 0, 'bed': 0, 'cat': 0, 'on': 0}

In [7]:
for word in bowA:
    wordDictA[word]+=1
    
for word in bowB:
    wordDictB[word]+=1

In [8]:
wordDictA

{'lap': 1, 'The': 1, 'dog': 0, 'my': 1, 'sat': 1, 'bed': 0, 'cat': 1, 'on': 1}

In [9]:
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])


Unnamed: 0,The,bed,cat,dog,lap,my,on,sat
0,1,0,1,0,1,1,1,1
1,1,1,0,1,0,1,1,1


In [10]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

In [11]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

In [12]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [13]:
idfs = computeIDF([wordDictA, wordDictB])

In [14]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [15]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [16]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,The,bed,cat,dog,lap,my,on,sat
0,0.0,0.0,0.050172,0.0,0.050172,0.0,0.0,0.0
1,0.0,0.050172,0.0,0.050172,0.0,0.0,0.0,0.0
