### Thinkful project 4.4.3
In this lesson you are presented a series of sentences and asked to calculate the tf-idf scores for each word in each in each sentence.

This notebook aims to demonstrate this programtically, as well as shed light on how the gensim and sklearn libraries handle this task.

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from gensim.models import TfidfModel
from gensim.corpora import Dictionary


  return f(*args, **kwds)
  return f(*args, **kwds)


These are the sentences:
    
    "The best Monty Python sketch is the one about the dead parrot, I laughed so hard."
    "I laugh when I think about Python's Ministry of Silly Walks sketch, it is funny, funny, funny, the best!"
    "Chocolate is the best ice cream dessert topping, with a great taste."
    "The Lumberjack Song is the funniest Monty Python bit: I can't think of it without laughing."
    "I would rather put strawberries on my ice cream for dessert, they have the best taste."
    "The taste of caramel is a fantastic accompaniment to tasty mint ice cream."


According the filter methods applied, the sentences can be shortened to these:

In [2]:
sentences=[
    " best Monty Python sketch laugh",
    "laugh Python sketch funny funny funny best",
    "best icecream dessert taste",
    "funny Monty Python laugh",
    "icecream dessert best taste",
    "taste taste icecream",
]

#### Similar to the lesson we want to find the counts of each word in each sentence.

In [3]:
#Instantiate CountVectorize to build our frequency matrix
counter = CountVectorizer()
#Vectorize the sentences
tf_mat = counter.fit_transform(sentences)
#Create a dataframe with the output from the counter, and the words as the columns
#We transpose later to stay consistent with the lesson
counts = pd.DataFrame(tf_mat.todense(), columns=counter.get_feature_names())
counts

Unnamed: 0,best,dessert,funny,icecream,laugh,monty,python,sketch,taste
0,1,0,0,0,1,1,1,1,0
1,1,0,3,0,1,0,1,1,0
2,1,1,0,1,0,0,0,0,1
3,0,0,1,0,1,1,1,0,0
4,1,1,0,1,0,0,0,0,1
5,0,0,0,1,0,0,0,0,2


Note that the CountVectorizer give the output with the words as columns and sentences as rows.
This is contrary to the lesson.

In [4]:
#Calculate the document frequency and collection frequency to calculate the idf

#Create a dataframe with the words as the index
tfidf_mat = pd.DataFrame(index=counter.get_feature_names())

#count the number of non-zero entried per column
tfidf_mat['df'] = counts.astype(bool).sum(axis=0)

#sum the instances of each word
tfidf_mat['cf'] = [sum(counts[i]) for i in counts]

#calculate the idf = log(N/dft)
tfidf_mat['idf'] = round(np.log2(6/tfidf_mat.df),4)
tfidf_mat

Unnamed: 0,df,cf,idf
best,4,4,0.585
dessert,2,2,1.585
funny,2,4,1.585
icecream,3,3,1.0
laugh,3,3,1.0
monty,2,2,1.585
python,3,3,1.0
sketch,2,2,1.585
taste,3,4,1.0


In [5]:
#Transpose and concat the counts df and the idf df
tfidf_mod = pd.concat([counts.T,tfidf_mat],axis=1)
tfidf_mod

Unnamed: 0,0,1,2,3,4,5,df,cf,idf
best,1,1,1,0,1,0,4,4,0.585
dessert,0,0,1,0,1,0,2,2,1.585
funny,0,3,0,1,0,0,2,4,1.585
icecream,0,0,1,0,1,1,3,3,1.0
laugh,1,1,0,1,0,0,3,3,1.0
monty,1,0,0,1,0,0,2,2,1.585
python,1,1,0,1,0,0,3,3,1.0
sketch,1,1,0,0,0,0,2,2,1.585
taste,0,0,1,0,1,2,3,4,1.0


In [6]:
#Calculate the tfidf

for col in tfidf_mod.iloc[:, :6]:#select only the first 6 columns
    tfidf_mod['tfidf'+str(col)] = round(tfidf_mod[col] * tfidf_mod.idf,4) #multiply the sentences by the idf column
    
#check output
tfidf_mod

Unnamed: 0,0,1,2,3,4,5,df,cf,idf,tfidf0,tfidf1,tfidf2,tfidf3,tfidf4,tfidf5
best,1,1,1,0,1,0,4,4,0.585,0.585,0.585,0.585,0.0,0.585,0.0
dessert,0,0,1,0,1,0,2,2,1.585,0.0,0.0,1.585,0.0,1.585,0.0
funny,0,3,0,1,0,0,2,4,1.585,0.0,4.755,0.0,1.585,0.0,0.0
icecream,0,0,1,0,1,1,3,3,1.0,0.0,0.0,1.0,0.0,1.0,1.0
laugh,1,1,0,1,0,0,3,3,1.0,1.0,1.0,0.0,1.0,0.0,0.0
monty,1,0,0,1,0,0,2,2,1.585,1.585,0.0,0.0,1.585,0.0,0.0
python,1,1,0,1,0,0,3,3,1.0,1.0,1.0,0.0,1.0,0.0,0.0
sketch,1,1,0,0,0,0,2,2,1.585,1.585,1.585,0.0,0.0,0.0,0.0
taste,0,0,1,0,1,2,3,4,1.0,0.0,0.0,1.0,0.0,1.0,2.0


### This all you need for the lesson, but continue

Now we want to check how our tfidf vectors correspond with Sklearn's tfidf vectorizer

In [7]:
#Instantiate the vanilla TFIDF vectorizer from SKlearn
vectorizer = TfidfVectorizer()

#transform our sentences
sk_tf = vectorizer.fit_transform(sentences)
#create a dataframe with the transpose of the output as data and the words as the index
sk_tf_T = pd.DataFrame(sk_tf.todense().T,index=vectorizer.get_feature_names())
#label the columns
sk_tf_T.columns = ['sk_'+str(i) for i in range(6)]

#check output
sk_tf_T

Unnamed: 0,sk_0,sk_1,sk_2,sk_3,sk_4,sk_5
best,0.364066,0.209294,0.421295,0.0,0.421295,0.0
dessert,0.0,0.0,0.582322,0.0,0.582322,0.0
funny,0.0,0.867872,0.0,0.540298,0.0,0.0
icecream,0.0,0.0,0.491636,0.0,0.491636,0.447214
laugh,0.424852,0.244239,0.0,0.456156,0.0,0.0
monty,0.503219,0.0,0.0,0.540298,0.0,0.0
python,0.424852,0.244239,0.0,0.456156,0.0,0.0
sketch,0.503219,0.289291,0.0,0.0,0.0,0.0
taste,0.0,0.0,0.491636,0.0,0.491636,0.894427


As you can see, the results are not the same. Do you know why?

In [8]:
#lets try again with some arguments in the Tf-Idf model

vectorizer1 = TfidfVectorizer(norm=None,smooth_idf=False)

#transform our sentences
sk_tf1 = vectorizer1.fit_transform(sentences)
#create a dataframe with the transpose of the output as data and the words as the index
sk_tf_T1 = pd.DataFrame(sk_tf1.todense().T,index=vectorizer1.get_feature_names())
#label the columns
sk_tf_T1.columns = ['sk_'+str(i) for i in range(6)]

#check output
sk_tf_T1

Unnamed: 0,sk_0,sk_1,sk_2,sk_3,sk_4,sk_5
best,1.405465,1.405465,1.405465,0.0,1.405465,0.0
dessert,0.0,0.0,2.098612,0.0,2.098612,0.0
funny,0.0,6.295837,0.0,2.098612,0.0,0.0
icecream,0.0,0.0,1.693147,0.0,1.693147,1.693147
laugh,1.693147,1.693147,0.0,1.693147,0.0,0.0
monty,2.098612,0.0,0.0,2.098612,0.0,0.0
python,1.693147,1.693147,0.0,1.693147,0.0,0.0
sketch,2.098612,2.098612,0.0,0.0,0.0,0.0
taste,0.0,0.0,1.693147,0.0,1.693147,3.386294


The output still does not match our basic calculations. Why is that?
The sklearn model is built on top of the gensim model, lets check that out.

In [9]:
#In order to use the gensim model, we need to do some preprocessing to our sentences.


#The workflow requires that our input an iterable of iterables, ie list of lists containing tokenized data.
#Note that this means that each word should be an element in a list. These lists of words (sentences) need 
#to be in a list as well

#look at our sentences
sentences

[' best Monty Python sketch laugh',
 'laugh Python sketch funny funny funny best',
 'best icecream dessert taste',
 'funny Monty Python laugh',
 'icecream dessert best taste',
 'taste taste icecream']

In [10]:
#Correct the format
sents = [i.split() for i in sentences]
sents

[['best', 'Monty', 'Python', 'sketch', 'laugh'],
 ['laugh', 'Python', 'sketch', 'funny', 'funny', 'funny', 'best'],
 ['best', 'icecream', 'dessert', 'taste'],
 ['funny', 'Monty', 'Python', 'laugh'],
 ['icecream', 'dessert', 'best', 'taste'],
 ['taste', 'taste', 'icecream']]

In [11]:
#Next we need to build a dictionary with an index key and the word as the value
dct = Dictionary(sents)  # fit dictionary
#look at output
[i for i in dct.values()],[i for i in dct]

(['Monty',
  'Python',
  'best',
  'laugh',
  'sketch',
  'funny',
  'dessert',
  'icecream',
  'taste'],
 [0, 1, 2, 3, 4, 5, 6, 7, 8])

In [12]:
#Next we build a corpus which outputs tuples in nested lists (this is BoW)
#each list is a sentence
#each tuple represents the word as the first element and its occurance as the second element

corpus = [dct.doc2bow(line) for line in sents]  # convert corpus to BoW format
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
 [(1, 1), (2, 1), (3, 1), (4, 1), (5, 3)],
 [(2, 1), (6, 1), (7, 1), (8, 1)],
 [(0, 1), (1, 1), (3, 1), (5, 1)],
 [(2, 1), (6, 1), (7, 1), (8, 1)],
 [(7, 1), (8, 2)]]

In [13]:
sents

[['best', 'Monty', 'Python', 'sketch', 'laugh'],
 ['laugh', 'Python', 'sketch', 'funny', 'funny', 'funny', 'best'],
 ['best', 'icecream', 'dessert', 'taste'],
 ['funny', 'Monty', 'Python', 'laugh'],
 ['icecream', 'dessert', 'best', 'taste'],
 ['taste', 'taste', 'icecream']]

In [14]:
#Finally we can create the model
model = TfidfModel(corpus)

In [15]:
dict(dct)

{0: 'Monty',
 1: 'Python',
 2: 'best',
 3: 'laugh',
 4: 'sketch',
 5: 'funny',
 6: 'dessert',
 7: 'icecream',
 8: 'taste'}

In [16]:
#check out a sentence in its vector form
vector = model[corpus[2]]
vector

[(2, 0.26550046748653927),
 (6, 0.719376514513868),
 (7, 0.4538760470273287),
 (8, 0.4538760470273287)]

In [17]:
#lets compare to sklearn and our basic calculus
tfidf_mod.iloc[:, 9:],sk_tf_T

(          tfidf0  tfidf1  tfidf2  tfidf3  tfidf4  tfidf5
 best       0.585   0.585   0.585   0.000   0.585     0.0
 dessert    0.000   0.000   1.585   0.000   1.585     0.0
 funny      0.000   4.755   0.000   1.585   0.000     0.0
 icecream   0.000   0.000   1.000   0.000   1.000     1.0
 laugh      1.000   1.000   0.000   1.000   0.000     0.0
 monty      1.585   0.000   0.000   1.585   0.000     0.0
 python     1.000   1.000   0.000   1.000   0.000     0.0
 sketch     1.585   1.585   0.000   0.000   0.000     0.0
 taste      0.000   0.000   1.000   0.000   1.000     2.0,
               sk_0      sk_1      sk_2      sk_3      sk_4      sk_5
 best      0.364066  0.209294  0.421295  0.000000  0.421295  0.000000
 dessert   0.000000  0.000000  0.582322  0.000000  0.582322  0.000000
 funny     0.000000  0.867872  0.000000  0.540298  0.000000  0.000000
 icecream  0.000000  0.000000  0.491636  0.000000  0.491636  0.447214
 laugh     0.424852  0.244239  0.000000  0.456156  0.000000  0.000000

In [18]:
#We can see that all three models are giving different results.
#Which one is correct?
#Can we get the gensim model to give us something identicle to our hand calculated tfidf?

In [19]:
#Disable the normalizer
model = TfidfModel(corpus, normalize=False)
vector = model[corpus[2]]
print(vector)

[(2, 0.5849625007211562), (6, 1.5849625007211563), (7, 1.0), (8, 1.0)]


We finally have an output equivalent to what we calculated by hand.
The point of this was to demonstrate, that most higher level tfidf processing is normalized under the hood.
For instance, SKlearn is built on top of gensim, this normalization takes place in gensim then the results are passed to SKlearn.
https://rare-technologies.com/pivoted-document-length-normalisation/