<a href="https://colab.research.google.com/github/tawfiqam/MI564/blob/main/TFIDF_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Revisiting Bag of Words

Let's revisit bag of words (BoA) from [the Naive Bayes classifier example](https://github.com/tawfiqam/MI564/blob/main/Naive_Bayes_Intro.ipynb).


The text:

`John likes to watch movies. Mary likes movies too. Each key is the word, and each value is the number of occurrences of that word in the given text document.`

BoW = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}

In [13]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from gensim import corpora
from nltk.corpus import stopwords
from collections import defaultdict

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

stoplist = stopwords.words('english')

texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
print(texts)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


In [2]:
print("The dictionary has: " +str(len(dictionary)) + " tokens")

for k, v in dictionary.token2id.items():
    print(f'{k:{15}} {v:{10}}')

The dictionary has: 12 tokens
computer                 0
human                    1
interface                2
response                 3
survey                   4
system                   5
time                     6
user                     7
eps                      8
trees                    9
graph                   10
minors                  11


In [3]:
#The corpus contains what we call a word vector
corpus

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

    "System and human system engineering testing of EPS",


TFIDF is the Term Frequency-Inverse Document Frequency model. Much like count vectorizer introduced in  is also a bag-of-words model. 

The difference here is that we are weighting the words so that those words that appear more rarely have a higher weight than those that appear at a higher frequency. Words appearing frequently across documents are less important. Those occuring more rarely, but not too rarely, are more important.

Then after that at the time of transformation, it takes a vector representation and returns another vector representation. The output vector will have the same dimensionality but the value of the rare features (at the time of training) will be increased. It basically converts integer-valued vectors into real-valued vectors.

In [5]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary


model = TfidfModel(corpus)

topWords = {}

corpus_tfidf = model[corpus]

for doc in corpus_tfidf:
    for iWord, tf_idf in doc:
        if iWord not in topWords:
            topWords[iWord] = 0

        if tf_idf > topWords[iWord]:
            topWords[iWord] = tf_idf

wordimportance = []
for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
    wordimportance.append((dictionary[item[0]],item[1]))
    print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
    if i == 100: break

 1: trees         1.0
 2: system        0.7184811607083769
 3: graph         0.7071067811865475
 4: minors        0.695546419520037
 5: response      0.6282580468670046
 6: survey        0.6282580468670046
 7: time          0.6282580468670046
 8: computer      0.5773502691896257
 9: human         0.5773502691896257
10: interface     0.5773502691896257
11: eps           0.5710059809418182
12: user          0.45889394536615247


In [15]:
from gensim import similarities
from nltk.tokenize import word_tokenize

fcorpus_tfidf = model[corpus]


#Query:
query_text = "I love icecream and gensim"
query_text = query_text.lower()
query_text = word_tokenize(query_text)
vec_bow = dictionary.doc2bow(query_text)
vec_tfidf = model[vec_bow]

In [19]:
fcorpus_tfidf

<gensim.interfaces.TransformedCorpus at 0x7f5651a4b650>