Computers are good with numbers, but not that much with textual data. One of the most widely used techniques to process textual data is TF-IDF

### Term Frequency Inverse Document Frequency
Term Frequency: This summarizes how often a given word appears within a document.<br>
Inverse Document Frequency: This downscales words that appear a lot across documents.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

Adv of Tf-idf: stop words are removed as their weights are reduced

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See below for a simple example.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

### Preliminaries


In [2]:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

### Create Text Data


In [44]:
# Create text
text_data = np.array(['I love Brazil. Brazil!',
                      'Sweden is best',
                      'Germany beats both'])

### Create Feature Matrix


In [45]:
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

# Show tf-idf feature matrix
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [46]:
# Show tf-idf feature matrix
tfidf.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

### View Feature Matrix As Data Frame


In [47]:
# Create data frame
pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())

Unnamed: 0,beats,best,both,brazil,germany,is,love,sweden
0,0.0,0.0,0.0,0.894427,0.0,0.0,0.447214,0.0
1,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.57735
2,0.57735,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0


## Implementation of TF-IDF from scratch in Python


In [67]:
docA = "cat sat on my face"
docB = "The The dog sat on my bed"

In [68]:
bowA = docA.split(" ")
bowB = docB.split(" ")

In [69]:
bowB

['The', 'The', 'dog', 'sat', 'on', 'my', 'bed']

In [70]:
wordSet = set(bowA).union(set(bowB))

In [71]:
wordSet

{'The', 'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat'}

In [72]:
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0)

In [73]:
wordDictA

{'The': 0, 'face': 0, 'sat': 0, 'dog': 0, 'my': 0, 'cat': 0, 'on': 0, 'bed': 0}

In [74]:
for word in bowA:
    wordDictA[word]+=1
    
for word in bowB:
    wordDictB[word]+=1

In [75]:
wordDictA

{'The': 0, 'face': 1, 'sat': 1, 'dog': 0, 'my': 1, 'cat': 1, 'on': 1, 'bed': 0}

In [76]:
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,The,bed,cat,dog,face,my,on,sat
0,0,0,1,0,1,1,1,1
1,2,1,0,1,0,1,1,1


In [77]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

In [78]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

In [79]:
tfBowA

{'The': 0.0,
 'face': 0.2,
 'sat': 0.2,
 'dog': 0.0,
 'my': 0.2,
 'cat': 0.2,
 'on': 0.2,
 'bed': 0.0}

In [80]:
tfBowB

{'The': 0.2857142857142857,
 'face': 0.0,
 'sat': 0.14285714285714285,
 'dog': 0.14285714285714285,
 'my': 0.14285714285714285,
 'cat': 0.0,
 'on': 0.14285714285714285,
 'bed': 0.14285714285714285}

In [81]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [82]:
idfs = computeIDF([wordDictA, wordDictB])

In [83]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [84]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [87]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,The,bed,cat,dog,face,my,on,sat
0,0.0,0.0,0.060206,0.0,0.060206,0.0,0.0,0.0
1,0.086009,0.043004,0.0,0.043004,0.0,0.0,0.0,0.0
