# An introduction to TF-IDF
## TF-IDF stands for “Term Frequency — Inverse Data Frequency”.

### Term Frequency (tf):
    > TF gives you the frequency of the word in each document in the corpus.
    > It is the ratio of number of times the word appears in a document compared to the total number of the words in document.
    > The TF of a word increases as the occurences of the word increases within the document.
    > Each doument has its own TF and the formula is TF(i,j) = (Number of times term t appears in a document)/
                    (Total number of terms in a document)
 <img src="TF_Formula.png"> 

### Inverse Data Frequency (idf):
    > IDF is used to calculate the weight of rare words across all the documents in the corpus.
    > The word that occures less frequent have high IDF.
    > The Formula for IDF is IDF(w) = log_e(Total number of documents / Number of documents with term t in it)
 <img src="IDF_Formula.png"> 

#### Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:
<img src="combine.png">
<img src="metaOfFormula.png">


#### Let’s take an example to get a clearer understanding.
    > Sentence 1 : The car is driven on the road.
    > Sentence 2: The truck is driven on the highway.
We will now calculate the TF-IDF for the above two documents, which represent our corpus.

<img src="TFIDF_Corpus.png">

## Let us use python to calculate the TF IDF for us

In [31]:
S1="The car is driven on the road."
S2="The truck is driven on the highway."

In [32]:
# special characters do not make lot of sense many times, so let us remove them first.
specialChars = ".?';:[]{}-+=@#!%^&*()~"
for c in specialChars:
    S1 = S1.replace(c, "")
    S2 = S2.replace(c, "")
print(S1)
print(S1)

The car is driven on the road
The car is driven on the road


In [33]:
bowA = S1.split(" ")
bowB = S2.split(" ")
wordSet = set(bowA).union(set(bowB))
print(bowA)
print(bowB)
print(wordSet)

['The', 'car', 'is', 'driven', 'on', 'the', 'road']
['The', 'truck', 'is', 'driven', 'on', 'the', 'highway']
{'highway', 'The', 'road', 'car', 'driven', 'the', 'truck', 'on', 'is'}


In [34]:
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [35]:
wordDictA

{'highway': 0,
 'The': 0,
 'road': 0,
 'car': 0,
 'driven': 0,
 'the': 0,
 'truck': 0,
 'on': 0,
 'is': 0}

In [36]:
for word in bowA:
    wordDictA[word]+=1
    
for word in bowB:
    wordDictB[word]+=1

In [37]:
wordDictA

{'highway': 0,
 'The': 1,
 'road': 1,
 'car': 1,
 'driven': 1,
 'the': 1,
 'truck': 0,
 'on': 1,
 'is': 1}

In [38]:
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,The,car,driven,highway,is,on,road,the,truck
0,1,1,1,0,1,1,1,1,0
1,1,0,1,1,1,1,0,1,1


In [39]:
def computeTF(wordDict, bow):
    """The function computeTF computes the TF score for each word in the corpus, by document."""
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

In [40]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

In [41]:
tfBowA

{'highway': 0.0,
 'The': 0.14285714285714285,
 'road': 0.14285714285714285,
 'car': 0.14285714285714285,
 'driven': 0.14285714285714285,
 'the': 0.14285714285714285,
 'truck': 0.0,
 'on': 0.14285714285714285,
 'is': 0.14285714285714285}

In [42]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [43]:
idfs = computeIDF([wordDictA, wordDictB])
idfs

{'highway': 0.3010299956639812,
 'The': 0.0,
 'road': 0.3010299956639812,
 'car': 0.3010299956639812,
 'driven': 0.0,
 'the': 0.0,
 'truck': 0.3010299956639812,
 'on': 0.0,
 'is': 0.0}

In [44]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [45]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [46]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,The,car,driven,highway,is,on,road,the,truck
0,0.0,0.043004,0.0,0.0,0.0,0.0,0.043004,0.0,0.0
1,0.0,0.0,0.0,0.043004,0.0,0.0,0.0,0.0,0.043004


## let us do it with scikit way

In [47]:
corpus = [S1,S2]
corpus

['The car is driven on the road', 'The truck is driven on the highway']

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [49]:
X.shape

(2, 8)

In [50]:
vectorizer.vocabulary_

{'the': 6,
 'car': 0,
 'is': 3,
 'driven': 1,
 'on': 4,
 'road': 5,
 'truck': 7,
 'highway': 2}

In [51]:
vectorizer.get_feature_names()

['car', 'driven', 'highway', 'is', 'on', 'road', 'the', 'truck']

In [52]:
X.data

array([0.60437955, 0.42471719, 0.30218978, 0.30218978, 0.30218978,
       0.42471719, 0.60437955, 0.30218978, 0.30218978, 0.30218978,
       0.42471719, 0.42471719])

In [56]:
X.toarray()

array([[0.42471719, 0.30218978, 0.        , 0.30218978, 0.30218978,
        0.42471719, 0.60437955, 0.        ],
       [0.        , 0.30218978, 0.42471719, 0.30218978, 0.30218978,
        0.        , 0.60437955, 0.42471719]])

In [57]:
pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,car,driven,highway,is,on,road,the,truck
0,0.424717,0.30219,0.0,0.30219,0.30219,0.424717,0.60438,0.0
1,0.0,0.30219,0.424717,0.30219,0.30219,0.0,0.60438,0.424717


In [58]:
#and as per our calculation
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,The,car,driven,highway,is,on,road,the,truck
0,0.0,0.043004,0.0,0.0,0.0,0.0,0.043004,0.0,0.0
1,0.0,0.0,0.0,0.043004,0.0,0.0,0.0,0.0,0.043004
