### Tf-IDF Model using Sklearn,nltk and Python

In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

The BOW model only considers if a known word occurs in a document or not. It does not care about meaning, context, and order in which they appear.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

Disadvantage is that we are not able to get which word is more important. The BOW model only considers if a known word occurs in a document or not. It does not care about meaning, context, and order in which they appear.

In [1]:
import numpy as np
import pandas as pd
import wikipedia as wp
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer,PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
# Lets import the TF-IDF Library
from sklearn.feature_extraction.text import TfidfVectorizer

### Lets verify TF and IDF Manually

In [2]:
docs = ['fred have never been to boston',
        'boston is in america',
        'paris is the capitol city of france',
        'this sentence has no named entities included',
        'i have been to san francisco and paris']

In [3]:
#docs = test_sentences_clean

In [4]:
len(docs)

5

In [5]:
docs

['fred have never been to boston',
 'boston is in america',
 'paris is the capitol city of france',
 'this sentence has no named entities included',
 'i have been to san francisco and paris']

#### Step 1. Get tfidf scores for boston token

In [6]:
tftdf_l1 = TfidfVectorizer(encoding='utf-8',
    decode_error='strict',
    strip_accents=None,
    lowercase=True,
    preprocessor=None,
    tokenizer=None,
    analyzer='word',
    stop_words=None,
    token_pattern='(?u)\\b\\w\\w+\\b',
    ngram_range=(1, 1),
    max_df=1.0,
    min_df=1,
    max_features=None,
    vocabulary=None,
    binary=False,
    norm='l1', # 'l1' or 'l2' 
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=False)
# (default settings have smooth_idf=True that adds “1” to the numerator and denominator as if an extra document was seen 
# containing every term in the collection exactly once, which prevents zero divisions).

In [7]:
tfidf_doc_l1 = tftdf_l1.fit_transform(docs).todense()

In [8]:
features_tfidf_l1 = tftdf_l1.get_feature_names()
features_tfidf_l1

['america',
 'and',
 'been',
 'boston',
 'capitol',
 'city',
 'entities',
 'france',
 'francisco',
 'fred',
 'has',
 'have',
 'in',
 'included',
 'is',
 'named',
 'never',
 'no',
 'of',
 'paris',
 'san',
 'sentence',
 'the',
 'this',
 'to']

In [9]:
len(features_tfidf_l1)

25

In [10]:
tftdf_l1.vocabulary_

{'fred': 9,
 'have': 11,
 'never': 16,
 'been': 2,
 'to': 24,
 'boston': 3,
 'is': 14,
 'in': 12,
 'america': 0,
 'paris': 19,
 'the': 22,
 'capitol': 4,
 'city': 5,
 'of': 18,
 'france': 7,
 'this': 23,
 'sentence': 21,
 'has': 10,
 'no': 17,
 'named': 15,
 'entities': 6,
 'included': 13,
 'san': 20,
 'francisco': 8,
 'and': 1}

In [11]:
# Craeate the dataframe
df_doc_l1 = pd.DataFrame(tfidf_doc_l1,columns=features_tfidf_l1)
df_doc_l1

Unnamed: 0,america,and,been,boston,capitol,city,entities,france,francisco,fred,...,named,never,no,of,paris,san,sentence,the,this,to
0,0.0,0.0,0.154346,0.154346,0.0,0.0,0.0,0.0,0.0,0.191308,...,0.0,0.191308,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.154346
1,0.276733,0.0,0.0,0.223267,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.151204,0.151204,0.0,0.151204,0.0,0.0,...,0.0,0.0,0.0,0.151204,0.12199,0.0,0.0,0.151204,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,...,0.142857,0.0,0.142857,0.0,0.0,0.0,0.142857,0.0,0.142857,0.0
4,0.0,0.160586,0.12956,0.0,0.0,0.0,0.0,0.0,0.160586,0.0,...,0.0,0.0,0.0,0.0,0.12956,0.160586,0.0,0.0,0.0,0.12956


In [12]:
#TFIDF for "boston"
df_doc_l1.loc[:,'boston']

0    0.154346
1    0.223267
2    0.000000
3    0.000000
4    0.000000
Name: boston, dtype: float64

#### Step 2.Calculate tfidf for boston token without Norm

**TF(t)** = No of times term t appear in document / Total No of terms in the document

**IDF(t)** = log((Total No of documents+1) / (Number of documents with term t in it)) +1

In [13]:
# 0th document
docs[0]

'fred have never been to boston'

In [14]:
# Total docs
len(docs)

5

In [15]:
# Counting the tfidf of 'boston' in 0th document
# Note, i does not count as a token according to builtin tokenization scheme.
tfidf_boston_wo_norm_0 = (1/6) * (np.log((1+5)/(1+2))+1)
tfidf_boston_wo_norm_0

0.2821911967599909

In [16]:
# 1st document
docs[1]

'boston is in america'

In [17]:
# Counting the tfidf of 'boston' in 1st document
# Note, i does not count as a token according to builtin tokenization scheme.
tfidf_boston_wo_norm_1 = (1/4) * (np.log((1+5)/(1+2))+1)
tfidf_boston_wo_norm_1

0.42328679513998635

#### Step 3. Normalization

In [18]:
docs

['fred have never been to boston',
 'boston is in america',
 'paris is the capitol city of france',
 'this sentence has no named entities included',
 'i have been to san francisco and paris']

In [19]:
# Let's calculate the l1 normalization first (we will calculate all the td*idf for all the words in a vector(sentence))
# All calculated non-normalized tfdid's should sum up to 1 by row:

l1_norm_0 = ((1/6) * (np.log((1+5)/(1+1))+1) +
         (1/6) * (np.log((1+5)/(1+2))+1) +
         (1/6) * (np.log((1+5)/(1+1))+1) +
         (1/6) * (np.log((1+5)/(1+2))+1) +
         (1/6) * (np.log((1+5)/(1+2))+1)+
         (1/6) * (np.log((1+5)/(1+2))+1))

l1_norm_1 = ((1/4) * (np.log((1+5)/(1+2))+1) +
         (1/4) * (np.log((1+5)/(1+2))+1) +
         (1/4) * (np.log((1+5)/(1+1))+1) +
         (1/4) * (np.log((1+5)/(1+1))+1))

tfidf_boston_w_l1_norm_0 = tfidf_boston_wo_norm_0/l1_norm_0
             
tfidf_boston_w_l1_norm_1 = tfidf_boston_wo_norm_1/l1_norm_1
   
# TFIDF for "boston"
print("'boston' tfidf for l1 \n",np.round(df_doc_l1.loc[:,'boston'],4))
             
print("tfidf_boston_w_l1_norm_0 : ",np.round(tfidf_boston_w_l1_norm_0,4)) 
print("tfidf_boston_w_l1_norm_1 : ",np.round(tfidf_boston_w_l1_norm_1,4)) 

'boston' tfidf for l1 
 0    0.1543
1    0.2233
2    0.0000
3    0.0000
4    0.0000
Name: boston, dtype: float64
tfidf_boston_w_l1_norm_0 :  0.1543
tfidf_boston_w_l1_norm_1 :  0.2233


we are getting the same tfidf score as above.

In [20]:
docs

['fred have never been to boston',
 'boston is in america',
 'paris is the capitol city of france',
 'this sentence has no named entities included',
 'i have been to san francisco and paris']

In [21]:
# Let's now do the same math for l2 norm.

tfidf_l2 = TfidfVectorizer(sublinear_tf=True,norm='l2')
tfidf_doc_l2 = tfidf_l2.fit_transform(docs).todense()


features_tfidf_l2 = tfidf_l2.get_feature_names()
features_tfidf_l2

# Craeate the dataframe
df_doc_l2 = pd.DataFrame(tfidf_doc_l2,columns=features_tfidf_l2)
df_doc_l2

#print(df_doc_l2)

# TFIDF for "boston"
print("'boston' tfidf for l2 \n",np.round(df_doc_l2.loc[:,'boston'],4))

l2_norm_0 = np.sqrt(((1/6) * (np.log((1+5)/(1+1))+1))**2 +
         ((1/6) * (np.log((1+5)/(1+2))+1))**2 +
         ((1/6) * (np.log((1+5)/(1+1))+1))**2 +
         ((1/6) * (np.log((1+5)/(1+2))+1))**2 +
         ((1/6) * (np.log((1+5)/(1+2))+1))**2+
         ((1/6) * (np.log((1+5)/(1+2))+1))**2)
                    
l2_norm_1 = np.sqrt(((1/4) * (np.log((1+5)/(1+2))+1))**2 +
         ((1/4) * (np.log((1+5)/(1+2))+1))**2 +
         ((1/4) * (np.log((1+5)/(1+1))+1))**2 +
         ((1/4) * (np.log((1+5)/(1+1))+1))**2)


tfidf_boston_w_l2_norm_0 = tfidf_boston_wo_norm_0/l2_norm_0
             
tfidf_boston_w_l2_norm_1 = tfidf_boston_wo_norm_1/l2_norm_1
             
             
print("tfidf_boston_w_l2_norm_0 : ",np.round(tfidf_boston_w_l2_norm_0,4)) 
print("tfidf_boston_w_l2_norm_1 : ",np.round(tfidf_boston_w_l2_norm_1,4)) 

'boston' tfidf for l2 
 0    0.376
1    0.444
2    0.000
3    0.000
4    0.000
Name: boston, dtype: float64
tfidf_boston_w_l2_norm_0 :  0.376
tfidf_boston_w_l2_norm_1 :  0.444


Limitations
The main limitation of TF IDF is that word order which is an important part of understanding the meaning of a sentence is not considered in TF-IDF.
Also, document length can introduce a lot of variance in the TF IDF values.

### Lets use some other dataset and first calculate the BOW and then TF-IDF

In [22]:
# Lets create small paragraph for testing
test_paragraph = "Bill travelled to the 'office' by his car from his house to play the football. Bill reached the office and want to play the football but picked up the 'football' there and left the office. Bill again went to the his friend's house. His friend's name is Fred. After reaching to his friend's house , he gave the football to Fred."

In [23]:
type(test_paragraph)

str

In [24]:
# Can we get the answer of the question "What did Bill give to Fred?" by using the BOW model ?
# Answer is football
# Lets apply the model

In [25]:
# Lets first convert paragraph into sentences
test_sentences = nltk.sent_tokenize(test_paragraph)

In [26]:
test_sentences

["Bill travelled to the 'office' by his car from his house to play the football.",
 "Bill reached the office and want to play the football but picked up the 'football' there and left the office.",
 "Bill again went to the his friend's house.",
 "His friend's name is Fred.",
 "After reaching to his friend's house , he gave the football to Fred."]

In [27]:
len(test_sentences)

5

In [28]:
# Clean the sentences
# Remove the usefullness characters sentence by sentence
test_sentences_clean = []
for i in range(len(test_sentences)):
    test_sentences1 = re.sub("[^0-9a-zA-Z]+",' ',test_sentences[i]) # remove all the words excpet alphanemeric
    test_sentences2 = test_sentences1.lower().split() # lower and then split the sentences in the words
    test_sentences3 = [w for w in test_sentences2 if w not in set(stopwords.words('english'))] # remove stopwords
    #test_sentences4 = list(set(test_sentences3)) # remove duplicate in sentences
    test_sentences4 = test_sentences3
    test_sentences5 = ' '.join(test_sentences4)
    test_sentences_clean.append(test_sentences5)

In [29]:
# Check the distinct words in paragraph after cleaning
# Build vocabulary
def test_words_clean(paragraph):
    test_words1 = re.sub("[^0-9a-zA-Z]+",' ',paragraph)
    test_words2 = test_words1.lower().split()
    test_words3 = [w for w in test_words2 if w not in set(stopwords.words('english'))]
    test_words4 = list(set(test_words3))
    return test_words4

In [30]:
# Vocabulary Size
test_words_clean = test_words_clean(test_paragraph)
len(test_words_clean)

17

In [31]:
test_words_clean

['fred',
 'football',
 'travelled',
 'want',
 'picked',
 'went',
 'bill',
 'office',
 'play',
 'reached',
 'friend',
 'name',
 'gave',
 'house',
 'left',
 'car',
 'reaching']

In [32]:
test_sentences_clean

['bill travelled office car house play football',
 'bill reached office want play football picked football left office',
 'bill went friend house',
 'friend name fred',
 'reaching friend house gave football fred']

In [33]:
# Lets apply the BOW model
# Define the model for oroginal sentences
bow_ori = CountVectorizer(stop_words=set(stopwords.words('english')))
# Define the model for cleaned sentences
bow_clean = CountVectorizer()

In [34]:
# Now train our original sentences with BOW
# Covert into clean test sentences into Matrix of Token Counts
X_ori = bow_ori.fit_transform(test_sentences).toarray()

In [35]:
# Now train our clean test sentences with BOW
# Covert into clean test sentences into Matrix of Token Counts
X_clean = bow_clean.fit_transform(test_sentences_clean).toarray()

In [36]:
# Lets check the shape of the matrix
print("Shape of Matix with Original Sentences",X_ori.shape)
print("Shape of Matix with Cleaned Sentences",X_clean.shape)

Shape of Matix with Original Sentences (5, 17)
Shape of Matix with Cleaned Sentences (5, 17)


In [37]:
print(X_ori)

[[1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0]
 [1 0 2 0 0 0 0 1 0 2 1 1 1 0 0 1 0]
 [1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0]]


In [38]:
print(X_clean)

[[1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0]
 [1 0 2 0 0 0 0 1 0 2 1 1 1 0 0 1 0]
 [1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0]]


In [39]:
# Lets check the features in Matrix created with Original sentences
features_ori = bow_ori.get_feature_names()
features_ori

['bill',
 'car',
 'football',
 'fred',
 'friend',
 'gave',
 'house',
 'left',
 'name',
 'office',
 'picked',
 'play',
 'reached',
 'reaching',
 'travelled',
 'want',
 'went']

In [40]:
# Lets check the features in Matrix created with cleaned sentences
features_clean = bow_clean.get_feature_names()
features_clean

['bill',
 'car',
 'football',
 'fred',
 'friend',
 'gave',
 'house',
 'left',
 'name',
 'office',
 'picked',
 'play',
 'reached',
 'reaching',
 'travelled',
 'want',
 'went']

In [41]:
# Create the dataframe for original sentences
df_ori = pd.DataFrame(X_ori,columns=features_ori)
df_ori

Unnamed: 0,bill,car,football,fred,friend,gave,house,left,name,office,picked,play,reached,reaching,travelled,want,went
0,1,1,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0
1,1,0,2,0,0,0,0,1,0,2,1,1,1,0,0,1,0
2,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1
3,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0


In [42]:
# Create the dataframe for cleaned sentences
df_clean = pd.DataFrame(X_clean,columns=features_clean)
df_clean

Unnamed: 0,bill,car,football,fred,friend,gave,house,left,name,office,picked,play,reached,reaching,travelled,want,went
0,1,1,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0
1,1,0,2,0,0,0,0,1,0,2,1,1,1,0,0,1,0
2,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1
3,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0


The length of the vector(X_test.shape[1]) will always be equal to vocabulary size("test_words_clean")

In [43]:
test_sentences_clean

['bill travelled office car house play football',
 'bill reached office want play football picked football left office',
 'bill went friend house',
 'friend name fred',
 'reaching friend house gave football fred']

Limitations of BOW

We can clearly see from the above df_test that our paragraph is cleally well tokenised but there are some disadvantages if we use this to train the ML model. If you see all the words are marked with count of that word in that sentemce so we can not say which words have more weightages
in the paragraphs.


Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.

Vector size: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.

### Lets use the TF-IDF to solve the limitaion of BOW

In [44]:
# Lets define the model for original sentences
vectorizer_ori = TfidfVectorizer(stop_words=set(stopwords.words('english')))

In [45]:
# Lets define the model for original sentences
vectorizer_clean = TfidfVectorizer()

In [46]:
# Now train our original sentences with TF-IDF
# Covert into clean test sentences into Matrix
X_ori_tfidf = vectorizer_ori.fit_transform(test_sentences).toarray()

In [47]:
# Now train our clean test sentences with TF-IDF
# Covert into clean test sentences into Matrix
X_clean_tfidf = vectorizer_clean.fit_transform(test_sentences_clean).toarray()

In [48]:
# Lets check the shape of the matrix
print("Shape of Matix with Original Sentences",X_ori_tfidf.shape)
print("Shape of Matix with Cleaned Sentences",X_clean_tfidf.shape)

Shape of Matix with Original Sentences (5, 17)
Shape of Matix with Cleaned Sentences (5, 17)


In [49]:
# Lets check the features in Matrix created with Original sentences
features_ori_tfidf = vectorizer_ori.get_feature_names()
features_ori_tfidf

['bill',
 'car',
 'football',
 'fred',
 'friend',
 'gave',
 'house',
 'left',
 'name',
 'office',
 'picked',
 'play',
 'reached',
 'reaching',
 'travelled',
 'want',
 'went']

In [50]:
# Lets check the features in Matrix created with clean sentences
features_clean_tfidf = vectorizer_clean.get_feature_names()
features_clean_tfidf

['bill',
 'car',
 'football',
 'fred',
 'friend',
 'gave',
 'house',
 'left',
 'name',
 'office',
 'picked',
 'play',
 'reached',
 'reaching',
 'travelled',
 'want',
 'went']

In [51]:
# Create the dataframe for original sentences
df_ori_tfidf = pd.DataFrame(X_ori_tfidf,columns=features_ori_tfidf)
df_ori_tfidf

Unnamed: 0,bill,car,football,fred,friend,gave,house,left,name,office,picked,play,reached,reaching,travelled,want,went
0,0.310659,0.46387,0.310659,0.0,0.0,0.0,0.310659,0.0,0.0,0.374247,0.0,0.374247,0.0,0.0,0.46387,0.0,0.0
1,0.217316,0.0,0.434632,0.0,0.0,0.0,0.0,0.324492,0.0,0.523595,0.324492,0.261798,0.324492,0.0,0.0,0.324492,0.0
2,0.437287,0.0,0.0,0.0,0.437287,0.0,0.437287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.652948
3,0.0,0.0,0.0,0.556816,0.462208,0.0,0.0,0.0,0.690159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.335004,0.403576,0.335004,0.500222,0.335004,0.0,0.0,0.0,0.0,0.0,0.0,0.500222,0.0,0.0,0.0


In [52]:
# Create the dataframe for clean sentences
df_clean_tfidf = pd.DataFrame(X_clean_tfidf,columns=features_clean_tfidf)
df_clean_tfidf

Unnamed: 0,bill,car,football,fred,friend,gave,house,left,name,office,picked,play,reached,reaching,travelled,want,went
0,0.310659,0.46387,0.310659,0.0,0.0,0.0,0.310659,0.0,0.0,0.374247,0.0,0.374247,0.0,0.0,0.46387,0.0,0.0
1,0.217316,0.0,0.434632,0.0,0.0,0.0,0.0,0.324492,0.0,0.523595,0.324492,0.261798,0.324492,0.0,0.0,0.324492,0.0
2,0.437287,0.0,0.0,0.0,0.437287,0.0,0.437287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.652948
3,0.0,0.0,0.0,0.556816,0.462208,0.0,0.0,0.0,0.690159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.335004,0.403576,0.335004,0.500222,0.335004,0.0,0.0,0.0,0.0,0.0,0.0,0.500222,0.0,0.0,0.0


We can see that in BOW we had word count for each words presented in that sentence but here we have different weightage for each words in that sentence. So this is more meaningfull compare to have word count.