* https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76
* https://en.wikipedia.org/wiki/Tf%E2%80%93idf

tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost.

In [3]:
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')

In [4]:
bagOfWordsA

['the', 'man', 'went', 'out', 'for', 'a', 'walk']

By casting the bag of words to a set, we can automatically remove any duplicate words.

In [5]:
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))

uniqueWords

{'a',
 'around',
 'children',
 'fire',
 'for',
 'man',
 'out',
 'sat',
 'the',
 'walk',
 'went'}

Next, we’ll create a dictionary of words and their occurence for each document in the corpus (collection of documents).

In [6]:
numOfWordsA = dict.fromkeys(uniqueWords, 0)

for word in bagOfWordsA:
    numOfWordsA[word] += 1
    
numOfWordsB = dict.fromkeys(uniqueWords, 0)

for word in bagOfWordsB:
    numOfWordsB[word] += 1

In [7]:
numOfWordsB

{'the': 2,
 'sat': 1,
 'for': 0,
 'around': 1,
 'children': 1,
 'out': 0,
 'a': 0,
 'man': 0,
 'fire': 1,
 'went': 0,
 'walk': 0}

In natural language processing, useless words are referred to as stop words. The python natural language toolkit library provides a list of english stop words.

In [9]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/robin/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
from nltk.corpus import stopwords

stopwords.words('english')[-10:]

['shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

Often times, when building a model with the goal of understanding text, you’ll see all of stop words being removed. Another strategy is to score the relative importance of words using TF-IDF.

**Term Frequency (TF)** = The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency.

In [12]:
def computeTF(wordDict, bagOfWords):
    """Calc term frequency."""
    tfDict = {}
    for word, count in wordDict.items():
        tfDict[word] = count / float(len(bagOfWords))
    return tfDict

The following lines compute the term frequency for each of our documents.

In [13]:
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)

In [14]:
tfA

{'the': 0.14285714285714285,
 'sat': 0.0,
 'for': 0.14285714285714285,
 'around': 0.0,
 'children': 0.0,
 'out': 0.14285714285714285,
 'a': 0.14285714285714285,
 'man': 0.14285714285714285,
 'fire': 0.0,
 'went': 0.14285714285714285,
 'walk': 0.14285714285714285}

**Inverse Data Frequency (IDF)** = The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

In [15]:
def computeIDF(documents):
    import math
    N = len(documents)
    
    idfDict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    return idfDict

The IDF is computed once for all documents.

In [16]:
idfs = computeIDF([numOfWordsA, numOfWordsB])

Lastly, the TF-IDF is simply the TF multiplied by IDF.

In [17]:
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf

Finally, we can compute the TF-IDF scores for all the words in the corpus.

In [18]:
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)

df = pd.DataFrame([tfidfA, tfidfB])

In [19]:
df

Unnamed: 0,a,around,children,fire,for,man,out,sat,the,walk,went
0,0.099021,0.0,0.0,0.0,0.099021,0.099021,0.099021,0.0,0.0,0.099021,0.099021
1,0.0,0.115525,0.115525,0.115525,0.0,0.0,0.0,0.115525,0.0,0.0,0.0


Rather than manually implementing TF-IDF ourselves, we could use the class provided by sklearn.

In [20]:
vectorizer = TfidfVectorizer()

vectors = vectorizer.fit_transform([documentA, documentB])

feature_names = vectorizer.get_feature_names()

dense = vectors.todense()

denselist = dense.tolist()

df = pd.DataFrame(denselist, columns=feature_names)

df

Unnamed: 0,around,children,fire,for,man,out,sat,the,walk,went
0,0.0,0.0,0.0,0.42616,0.42616,0.42616,0.0,0.303216,0.42616,0.42616
1,0.407401,0.407401,0.407401,0.0,0.0,0.0,0.407401,0.579739,0.0,0.0


The values differ slightly because sklearn uses a smoothed version idf and various other little optimizations. In an example with more text, the score for the word the would be greatly reduced.