# TF-IDF Explaination:



    TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

    IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

    IDF(t) = log_e(Total number of documents / Number of documents with term t in it).


In [1]:
docA = "the cat sat on my sofa"
docB = "the dog sat on my bed" 

In [2]:
bowA = docA.split(" ")
bowB = docB.split(" ")
bowA

['the', 'cat', 'sat', 'on', 'my', 'sofa']

In [3]:
set(bowA)

{'cat', 'my', 'on', 'sat', 'sofa', 'the'}

In [5]:
# Vocabulary in the corpus
wordSet = set(bowA).union(set(bowB))
wordSet

{'bed', 'cat', 'dog', 'my', 'on', 'sat', 'sofa', 'the'}

In [6]:
#Dictionaries to keep the word count in each bag of words
wordDictA = dict.fromkeys(wordSet,0)
wordDictB = dict.fromkeys(wordSet,0)

In [7]:
wordDictA

{'bed': 0, 'cat': 0, 'dog': 0, 'my': 0, 'on': 0, 'sat': 0, 'sofa': 0, 'the': 0}

In [8]:
# count the frequency of each word in the dictionary
for word in bowA:
    wordDictA[word]+=1
    
for word in bowB:
    wordDictB[word]+=1

In [11]:
print(wordDictA)
print(wordDictB)

{'the': 1, 'dog': 0, 'on': 1, 'cat': 1, 'my': 1, 'sofa': 1, 'sat': 1, 'bed': 0}
{'the': 1, 'dog': 1, 'on': 1, 'cat': 0, 'my': 1, 'sofa': 0, 'sat': 1, 'bed': 1}


In [13]:
import pandas as pd
#Put them into a matrix
pd.DataFrame([wordDictA,wordDictB])

Unnamed: 0,bed,cat,dog,my,on,sat,sofa,the
0,0,1,0,1,1,1,1,1
1,1,0,1,1,1,1,0,1


## Instead of just depending on term frequency to get the important words, we must use the TF-IDF score of the word to rank it's importance.

In [17]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/ float(bowCount)
    return tfDict

In [19]:
tfbowA = computeTF(wordDictA,bowA)
tfbowA

{'bed': 0.0,
 'cat': 0.16666666666666666,
 'dog': 0.0,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666,
 'sofa': 0.16666666666666666,
 'the': 0.16666666666666666}

In [25]:
tfbowB = computeTF(wordDictB,bowB)
tfbowB

{'bed': 0.16666666666666666,
 'cat': 0.0,
 'dog': 0.16666666666666666,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666,
 'sofa': 0.0,
 'the': 0.16666666666666666}

In [20]:
import math
def computeIDF(docList):
    idfDict = {}
    N = len(docList)
    
    #count the number of documents that contains the word w
    idfDict = dict.fromkeys(docList[0].keys(),0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
                
    #divide N by denominator above and take log of that
    for word, val in idfDict.items():
        idfDict[word]= math.log(N/float(val))
        
    return idfDict

In [22]:
idfs = computeIDF([wordDictA,wordDictB])
idfs

{'bed': 0.6931471805599453,
 'cat': 0.6931471805599453,
 'dog': 0.6931471805599453,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'sofa': 0.6931471805599453,
 'the': 0.0}

In [23]:
def computeTFIDF(tfBow,idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [24]:
tfIDFA = computeTFIDF(tfbowA,idfs)
tfIDFA

{'bed': 0.0,
 'cat': 0.11552453009332421,
 'dog': 0.0,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'sofa': 0.11552453009332421,
 'the': 0.0}

In [26]:
tfIDFB = computeTFIDF(tfbowB, idfs)
tfIDFB

{'bed': 0.11552453009332421,
 'cat': 0.0,
 'dog': 0.11552453009332421,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'sofa': 0.0,
 'the': 0.0}

In [27]:
pd.DataFrame([tfIDFA,tfIDFB])

Unnamed: 0,bed,cat,dog,my,on,sat,sofa,the
0,0.0,0.115525,0.0,0.0,0.0,0.0,0.115525,0.0
1,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0,0.0
