<a href="https://colab.research.google.com/github/tawfiqam/MI564/blob/main/TFIDF_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Revisiting Bag of Words

Let's revisit bag of words (BoA) from [the Naive Bayes classifier example](https://github.com/tawfiqam/MI564/blob/main/Naive_Bayes_Intro.ipynb).


The text:

`John likes to watch movies. Mary likes movies too. Each key is the word, and each value is the number of occurrences of that word in the given text document.`

BoW = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}

In [16]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from gensim import corpora
from nltk.corpus import stopwords
from collections import defaultdict

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"
]

stoplist = stopwords.words('english')

texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
print(texts)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


In [18]:
print("The dictionary has: " +str(len(dictionary)) + " tokens")

for k, v in dictionary.token2id.items():
    print(f'{k:{15}} {v:{10}}')

The dictionary has: 12 tokens
computer                 0
human                    1
interface                2
response                 3
survey                   4
system                   5
time                     6
user                     7
eps                      8
trees                    9
graph                   10
minors                  11


In [19]:
#The corpus contains what we call a word vector
corpus

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

    "System and human system engineering testing of EPS",


TFIDF is the Term Frequency-Inverse Document Frequency model. Much like count vectorizer introduced in  is also a bag-of-words model. 

The difference here is that we are weighting the words so that those words that appear more rarely have a higher weight than those that appear at a higher frequency. Words appearing frequently across documents are less important. Those occuring more rarely, but not too rarely, are more important.

Then after that at the time of transformation, it takes a vector representation and returns another vector representation. The output vector will have the same dimensionality but the value of the rare features (at the time of training) will be increased. It basically converts integer-valued vectors into real-valued vectors.

In [20]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary


model = TfidfModel(corpus)

topWords = {}

corpus_tfidf = model[corpus]

for doc in corpus_tfidf:
    for iWord, tf_idf in doc:
        if iWord not in topWords:
            topWords[iWord] = 0

        if tf_idf > topWords[iWord]:
            topWords[iWord] = tf_idf

wordimportance = []
for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
    wordimportance.append((dictionary[item[0]],item[1]))
    print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
    if i == 100: break

 1: trees         1.0
 2: system        0.7184811607083769
 3: graph         0.7071067811865475
 4: minors        0.695546419520037
 5: response      0.6282580468670046
 6: survey        0.6282580468670046
 7: time          0.6282580468670046
 8: computer      0.5773502691896257
 9: human         0.5773502691896257
10: interface     0.5773502691896257
11: eps           0.5710059809418182
12: user          0.45889394536615247


By creating a TF-IDF 

In [23]:
!pip install psaw

Collecting psaw
  Downloading https://files.pythonhosted.org/packages/01/fe/e2f43241ff7545588d07bb93dd353e4333ebc02c31d7e0dc36a8a9d93214/psaw-0.1.0-py3-none-any.whl
Installing collected packages: psaw
Successfully installed psaw-0.1.0


In [24]:
import pandas as pd
#we will need datetime in order to specify the timeline we need to collect the data
import datetime as dt

#now we import the wrapper in order to use the API
from psaw import PushshiftAPI

api = PushshiftAPI()

In [25]:
#this function will allow us to find the last day of each month
#for example, there are 31 days in January, but 28 this February
def last_day_of_month(any_day):
    # this will never fail
    # get close to the end of the month for any day, and add 4 days 'over'
    next_month = any_day.replace(day=28) + datetime.timedelta(days=4)
    # subtract the number of remaining 'overage' days to get last day of current month, or said programattically said, the previous day of the first of next month
    return next_month - datetime.timedelta(days=next_month.day)

In [27]:


import datetime
subredditlist = ['Ex_Foster']
for reddit in subredditlist:
    for y in range(2019,2021):
      for i in range(1,12):
          file_name= str(reddit)+"_"+str(y)+"_"+str(i)+".json"
          print("starting with the month "+str(i))
          print("for subreddit..."+str(reddit))
          print("setting start epoch...")
          start_epoch=int(dt.datetime(y, i, 1).timestamp())
          print("setting end epoch...")
          last_day = last_day_of_month(datetime.date(y, i, 1))
          print("the last day of the month is...")
          print(last_day.day)
          last_day = int(last_day.day)
          end_epoch = int(dt.datetime(y,i,last_day).timestamp())
          print("setting up the generator...")
          gen = api.search_comments(after=start_epoch, before=end_epoch,subreddit=reddit)
          print("setting up the dataframe...")
          df = pd.DataFrame([obj.d_ for obj in gen])
          print("The number of comments for year "+ str(y)+" and month "+str(i)+" is "+str(len(df.index)))
          df.to_json(file_name)



starting with the month 1
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
31
setting up the generator...
setting up the dataframe...
The number of comments for year 2019 and month 1 is 0
starting with the month 2
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
28
setting up the generator...
setting up the dataframe...
The number of comments for year 2019 and month 2 is 0
starting with the month 3
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
31
setting up the generator...
setting up the dataframe...
The number of comments for year 2019 and month 3 is 69
starting with the month 4
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
30
setting up the generator...
setting up the dataframe...
The number of comments for year 2019 and month 4 is 233
starting with the month 5
for sub



The number of comments for year 2019 and month 7 is 78
starting with the month 8
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
31
setting up the generator...
setting up the dataframe...
The number of comments for year 2019 and month 8 is 80
starting with the month 9
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
30
setting up the generator...
setting up the dataframe...
The number of comments for year 2019 and month 9 is 229
starting with the month 10
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
31
setting up the generator...
setting up the dataframe...
The number of comments for year 2019 and month 10 is 351
starting with the month 11
for subreddit...Ex_Foster
setting start epoch...
setting end epoch...
the last day of the month is...
30
setting up the generator...
setting up the dataframe...
The number of comments for 

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

vectorizer = TfidfVectorizer(ngram_range=(1,3),stop_words=stoplist)
X = vectorizer.fit_transform(df['body'])
features_by_gram = defaultdict(list)

In [31]:
for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):
    features_by_gram[len(f.split(' '))].append((f, w))

top_n = 20
for gram, features in features_by_gram.items():
    top_features = sorted(features, key=lambda x: x[1], reverse=False)[:top_n]
    top_features = [f[0] for f in top_features]
    print('{}-gram top:'.format(gram), top_features)

1-gram top: ['like', 'know', 'sorry', 'people', 'family', 'going', 'life', 'good', 'want', 'care', 'foster', 'really', 'say', 'better', 'even', 'love', 'things', 'feel', 'get', 'help']
2-gram top: ['foster care', 'sounds like', 'even though', 'put username', 'sorry loss', 'feel free', 'feels like', 'easier said', 'feel like', 'former foster', 'foster youth', 'go back', 'people know', 'said done', 'sorry going', 'wish best', '20 years', 'beautiful day', 'buy something', 'care even']
3-gram top: ['easier said done', 'ever need someone', 'feel free dm', 'one year care', '10 years like', '10x definitely blessing', '12 best answers', '13 years hurts', '14 imagining future', '16 especially home', '16 matter wonderful', '18 held resentment', '18 moved another', '20 years ago', '20 years every', '20s man felt', '25 still close', '30 hard resentful', '34 still get', '70 horses boys']
