# Text Analysis of HITS
In this notebook I will be performing basic text mining techniques to gain some additional understanding from our scrape of Amazon MTURK.

I will be leveraging the scikitlearn framework to perform some of this analysis.

In [86]:
# Library Imports
from getpass import getpass
import os
import json
import numpy as np
import pandas as pd
import spacy
from time import time
import re

# Vectorizers from scikitlearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.decomposition import NMF, LatentDirichletAllocation


t0 = time()
! curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
! sudo apt-get install git-lfs
! git lfs install
# wipe all files
! rm -rf *
# Clone Dataset, utilizes getpass so only users with access to dataset can run the dataset
user = getpass('GitHub user')
password = getpass('GitHub password')
os.environ['GITHUB_AUTH'] = user + ':' + password
os.environ['USER'] = user
! rm -rf research/
! git clone https://$GITHUB_AUTH@github.com/$USER/research.git
# ! ls research/datasets/
# ! echo "untarring raw_mongodb_json_dump.tar.xz" && tar -xzvf research/datasets/raw_mongodb_json_dump.tar.gz
! cd research/datasets && ls
print('done importing libraries and dataset in %0.3fs.' % (time() - t0))

Detected operating system as Ubuntu/bionic.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2).
The following package was automatically installed and is no longer required:
  libnvidia-common-430
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 68 not upgraded.
Error: Failed to call git rev-parse --git-dir: exit status 128 
Git LFS initialized.
GitHub user··········
GitHub password··········
Cloning into 'research'...
remote: Enumerating objects: 81, done.[K
remote: Counting objects: 100% (81/81), done.[K
remote: Compressing o

## Creating a Bag of Words for Statistical Analysis

Bag of Words is an good approach for cleansing data to be used in NLP models. Below I utilize the spaCy API to lemmatize the HIT dataset which combines words that mean the same thing but may have been written in a different tense or case, then I instantiate a vectorizer from `sklearn` which shows the number of occurences of each word in the corpus (simple BoW). This is then displayed in a `pandas` dataframe. The `sklearn` `CountVectorizer` also removes english stop words (words that don't bring much value i.e the, an, etc). 

In [87]:
# CountVectorizer Bag of Words

# create a spaCy tokenizer
spacy.load('en')

# instantiate a lemmatizer to get rid of different words that mean the same thing
lemmatizer = spacy.lang.en.English()

hits = []
descriptions = []
titles = []
previews = []

t0 = time()
# convert hits.json file into dictionary, populate hits list, descriptions list, and titles list
for line in open("research/datasets/hits.json"):
  hit = json.loads(line)
  hits.append(hit)
  descriptions.append(hit['description'])
  titles.append(hit['title'])
print('INFO: done parsing hits dataset, finished in %0.3fs.' % (time() - t0))

t0 = time()
# convert preview.json file into dictionary, populate previews list
for line in open("research/datasets/preview.json"):
  preview = json.loads(line)
  page_src = preview['page_src']
  clean_src = re.sub("<.*?>", "", page_src)
  previews.append(clean_src)

print('INFO: done parsing preview dataset, finished in %0.3fs.' % (time() - t0))
# print(hits[0]['title'])
# print(previews[0])
# create a dataframe from a word matrix
def word_matrix_to_data_frame(word_matrix, feat_names):
    # create an index for each row
    doc_names = ['Description{:d}'.format(idx) for idx, _ in enumerate(word_matrix)]
    df = pd.DataFrame(data=word_matrix.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)


# using spaCy lemmatizer, it returns that list of lemmatized tokens
def lemma_tokenizer(data): 
  lemma_tokens = lemmatizer(data)
  return [tok.lemma_ for tok in lemma_tokens]

# t0 = time()
# descriptions = lemma_tokenizer(' '.join([str(s) for s in descriptions]))
# print('INFO: done lemmatizing descriptions, finished in %0.3fs.' % (time() - t0))

# t0 = time()
# titles = lemma_tokenizer(' '.join([str(s) for s in titles]))
# print('INFO: done lemmatizing titles, finished in %0.3fs.' % (time() - t0))

# t0 = time()
# titles = lemma_tokenizer(' '.join([str(s) for s in previews]))
# print('INFO: done lemmatizing previews, finished in %0.3fs.' % (time() - t0))

# instantiate the CountVectorizer, unigram
count_vec = CountVectorizer(lowercase=False, stop_words='english')

t0 = time()
# convert the documents into a word matrix
cv_word_matrix = count_vec.fit_transform(descriptions)
print('INFO: done converting descriptions into word matrix, finished in %0.3fs.' % (time() - t0))

t0 = time()
# extract the features from the description
cv_tokens = count_vec.get_feature_names()
print('INFO: done extracting features from dataset, finished in %0.3fs.' % (time() - t0))

t0 = time()

INFO: done parsing hits dataset, finished in 0.037s.
INFO: done parsing preview dataset, finished in 3.288s.
INFO: done converting descriptions into word matrix, finished in 0.015s.
INFO: done extracting features from dataset, finished in 0.001s.


In [88]:
# print out the panda dataframe for the CountVectorizer
word_matrix_to_data_frame(cv_word_matrix, cv_tokens)

Unnamed: 0,00,01,05,10,100,11,12,13,14,15,18,19,20,200,22,23,24,25,25mins,27,2755,28,30,31,32,33,34,35,37,39,40,41,42,420,43,44,45,46,47,4701,...,visible,visit,visual,voice,voluntary,volunteer,volunteers,want,war,watch,watching,waves,way,wealth,web,webpage,website,websites,week,weeks,weight,win,window,word,words,work,worker,workers,working,workout,workplace,world,worth,write,writing,written,years,youth,zillow,zin
Description0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Description1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Description2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Description3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Description4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Description1050,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Description1051,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Description1052,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Description1053,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [89]:
print('INFO: done printing dataframe for BoW CountVectorizer on dataset descriptions, finished in %0.3fs.' % (time() - t0))

INFO: done printing dataframe for BoW CountVectorizer on dataset descriptions, finished in 0.192s.


# Topic Modeling using NNMF and LDA
For the next part of this analysis I will be leveraging Non-negative Matrix Factorization (LSI) and Latentent Dirichlet Allocation to extract topics from the MTURK web scrape.

For the NNMF I will utilize a TF-IDF BoW (Vectorizer) which weights words based of their appearance in an individual document as well as how frequently they appear in the corpus as a whole.

**Global variables and common functions below:**

In [0]:
# Declare the number of features to extract from the dataset
NUM_FEATURES = 1000
# Declare the number of topics to fit to
NUM_COMPONENTS = 10
# Grabs the top words for each topic
NUM_TOP_WORDS = 20

# remove html entities from docs and
# set everything to lowercase
def strip_html_preprocessor(doc):
    return(unescape(doc).lower())

# returns TFIDF word matrix, used in NNMF, dataset: list
def get_tfidf_word_matrix_vectorizer(dataset): 
  # extract tfidf features for use in the NNMF
  tfidf_vectorizer = TfidfVectorizer(preprocessor=strip_html_preprocessor, max_df=0.95, min_df=2, max_features=NUM_FEATURES,
                                    stop_words='english', lowercase=False)
  t0 = time()
  tf_word_matrix = tfidf_vectorizer.fit_transform(dataset)
  print("INFO: done converting dataset to TFIDF word matrix, done in %0.3fs." % (time() - t0))
  return tf_word_matrix, tfidf_vectorizer

# returns the raw word count word vector for use in LDA
def get_cv_word_matrix_vectorizer(dataset):
  cv_vectorizer = CountVectorizer(preprocessor=strip_html_preprocessor, lowercase=False, stop_words='english')

  t0 = time()
  cv_word_matrix = cv_vectorizer.fit_transform(dataset)
  print("INFO: done converting dataset to CV (raw count) word matrix, done in %0.3fs." % (time() - t0))
  return cv_word_matrix, cv_vectorizer

# prints the top words from the dataset
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()





# Text Analysis on Descriptions

First Output is for a Non-negative Matrix Factorization fitted with Frobenius Norm and TFIDF features.

The second output is for a Non-negative Matrix Factorization generalized on Kullback-Leibler Divergence on TFIDF features. This is equivalent to Probabilistic Latent Semantic Indexing (LSI).

The third output is a Latent Dirilecht Allocation.



In [91]:
# Extracts the TFIDF word matrix and Vectorizer
tf_word_matrix_descriptions, tf_vec_descriptions = get_tfidf_word_matrix_vectorizer(descriptions)

# fits NNMF with Frobenius norms on descriptions
t0 = time()
nnmf_fr = NMF(n_components=NUM_COMPONENTS, random_state=1,
              alpha=0.1, l1_ratio=.5).fit(tf_word_matrix_descriptions)
print('INFO: Finished fitting NNMF with Frobenius Norm and TFIDF word matrix, done in %0.3fs.' % (time() - t0))

# extracts top features from dataset
tfidf_top_features = tf_vec_descriptions.get_feature_names()

# prints topics for NNMF with Frobenius norms and TFIDF on HIT descriptions
print_top_words(nnmf_fr, tfidf_top_features, NUM_TOP_WORDS)

# fits NNMF(generalized with Kullback-Leibler divergence) with TFIDF
t0 = time()
nnmf_kld = NMF(n_components=NUM_COMPONENTS, random_state=1, 
               beta_loss='kullback-leibler',
               solver='mu', max_iter=1000, alpha=.1,
               l1_ratio=0.5).fit(tf_word_matrix_descriptions)
print('INFO: Finished fitting NNMF (generalized with Kullback-Leibler divergence) with TFIDF word matrix, done in %0.3fs.' % (time() - t0))

# prints topics for NNMF (generalized with Kullback-Leibler divergence) with TFIDF word matrix
print_top_words(nnmf_kld, tfidf_top_features, NUM_TOP_WORDS)

# fits LDA on raw count word matrix
t0 = time()
lda = LatentDirichletAllocation(n_components=NUM_COMPONENTS, max_iter=5,
                                learning_method='online',
                                learning_offset=50,
                                random_state=0)

cv_word_matrix_descriptions, cv_vec_descriptions = get_cv_word_matrix_vectorizer(descriptions)

# fit LDA on raw count word matrix
lda.fit(cv_word_matrix_descriptions)
print('INFO: Finished fitting LDA with raw word matrix, done in %0.3fs.', (time() - t0))

cv_top_features = cv_vec_descriptions.get_feature_names()
print_top_words(lda, cv_top_features, NUM_TOP_WORDS)


INFO: done converting dataset to TFIDF word matrix, done in 0.018s.
INFO: Finished fitting NNMF with Frobenius Norm and TFIDF word matrix, done in 0.163s.
Topic #0: audit user contained transcribed receipt information image transcribe amounts grocery items given enter numbers upcs scanned data recipe expected experience
Topic #1: validate pair question provided answer questions read based filling filled file fields feedback fast zoster family failure extract extra experimental
Topic #2: video short transcribe data images amounts receipt grocery watch following search upcs numbers surgeon recording total complete minute card say
Topic #3: classify following image text images data transcribe adult content write replacement words shown displayed extra extract failure family fast feedback
Topic #4: audio installed said flash listen clip review seconds transcribe minute minutes 31 14 recording data second images address 22 11
Topic #5: shopping extract receipts summary items information tot

# Text Analysis on Title

First Output is for a Non-negative Matrix Factorization fitted with Frobenius Norm and TFIDF features.

The second output is for a Non-negative Matrix Factorization generalized on Kullback-Leibler Divergence on TFIDF features. This is equivalent to Probabilistic Latent Semantic Indexing (LSI).

The third output is a Latent Dirilecht Allocation.

In [92]:
# Extracts the TFIDF word matrix and Vectorizer
tf_word_matrix_titles, tf_vec_titles = get_tfidf_word_matrix_vectorizer(titles)

# fits NNMF with Frobenius norms on titles
t0 = time()
nnmf_fr = NMF(n_components=NUM_COMPONENTS, random_state=1,
              alpha=0.1, l1_ratio=.5).fit(tf_word_matrix_titles)
print('INFO: Finished fitting NNMF with Frobenius Norm and TFIDF word matrix, done in %0.3fs.' % (time() - t0))

# extracts top features from dataset
tfidf_top_features = tf_vec_titles.get_feature_names()

# prints topics for NNMF with Frobenius norms and TFIDF on HIT titles
print_top_words(nnmf_fr, tfidf_top_features, NUM_TOP_WORDS)

# fits NNMF(generalized with Kullback-Leibler divergence) with TFIDF
t0 = time()
nnmf_kld = NMF(n_components=NUM_COMPONENTS, random_state=1, 
               beta_loss='kullback-leibler',
               solver='mu', max_iter=1000, alpha=.1,
               l1_ratio=0.5).fit(tf_word_matrix_titles)
print('INFO: Finished fitting NNMF (generalized with Kullback-Leibler divergence) with TFIDF word matrix, done in %0.3fs.' % (time() - t0))

# prints topics for NNMF (generalized with Kullback-Leibler divergence) with TFIDF word matrix
print_top_words(nnmf_kld, tfidf_top_features, NUM_TOP_WORDS)

# fits LDA on raw count word matrix
t0 = time()
lda = LatentDirichletAllocation(n_components=NUM_COMPONENTS, max_iter=5,
                                learning_method='online',
                                learning_offset=50,
                                random_state=0)

cv_word_matrix_titles, cv_vec_titles = get_cv_word_matrix_vectorizer(titles)

# fit LDA on raw count word matrix
lda.fit(cv_word_matrix_titles)
print('INFO: Finished fitting LDA with raw word matrix, done in %0.3fs.', (time() - t0))

cv_top_features = cv_vec_titles.get_feature_names()
print_top_words(lda, cv_top_features, NUM_TOP_WORDS)

INFO: done converting dataset to TFIDF word matrix, done in 0.012s.
INFO: Finished fitting NNMF with Frobenius Norm and TFIDF word matrix, done in 0.135s.
Topic #0: audit receipt transcription items purchased invoice restaurant enter itemization bonus card emotions dollars determine development dirección english discretion document does
Topic #1: validation answer survey question song listen questions personality restaurant emotions quick hiphop rap editing discretion desktop details determine development electronics
Topic #2: video short transcribe shown clip skill surgeon rate search samples recording task recipe minute address store discretion editing details determine
Topic #3: audio transcription recording address email survey transcribe hieroglyph train_sec_prefilledcontent_11222019_rectified_v1 french partially filled card document dollars does discretion draw domain development
Topic #4: following image classify classification images enter shown contain description summary mark

# Text Analysis on Previews

**For the page source of the previews I'm stripping the html tags with a regex to get just the strings from the page source. Need to remove the css and the javascript to improve the analysis.**

First Output is for a Non-negative Matrix Factorization fitted with Frobenius Norm and TFIDF features.

The second output is for a Non-negative Matrix Factorization generalized on Kullback-Leibler Divergence on TFIDF features. This is equivalent to Probabilistic Latent Semantic Indexing (LSI).

The third output is a Latent Dirilecht Allocation.

In [93]:
# Extracts the TFIDF word matrix and Vectorizer
tf_word_matrix_previews, tf_vec_previews = get_tfidf_word_matrix_vectorizer(previews)

# fits NNMF with Frobenius norms on previews
t0 = time()
nnmf_fr = NMF(n_components=NUM_COMPONENTS, random_state=1,
              alpha=0.1, l1_ratio=.5).fit(tf_word_matrix_previews)
print('INFO: Finished fitting NNMF with Frobenius Norm and TFIDF word matrix, done in %0.3fs.' % (time() - t0))

# extracts top features from dataset
tfidf_top_features = tf_vec_previews.get_feature_names()

# prints topics for NNMF with Frobenius norms and TFIDF on HIT previews
print_top_words(nnmf_fr, tfidf_top_features, NUM_TOP_WORDS)

# fits NNMF(generalized with Kullback-Leibler divergence) with TFIDF
t0 = time()
nnmf_kld = NMF(n_components=NUM_COMPONENTS, random_state=1, 
               beta_loss='kullback-leibler',
               solver='mu', max_iter=1000, alpha=.1,
               l1_ratio=0.5).fit(tf_word_matrix_previews)
print('INFO: Finished fitting NNMF (generalized with Kullback-Leibler divergence) with TFIDF word matrix, done in %0.3fs.' % (time() - t0))

# prints topics for NNMF (generalized with Kullback-Leibler divergence) with TFIDF word matrix
print_top_words(nnmf_kld, tfidf_top_features, NUM_TOP_WORDS)

# fits LDA on raw count word matrix
t0 = time()
lda = LatentDirichletAllocation(n_components=NUM_COMPONENTS, max_iter=5,
                                learning_method='online',
                                learning_offset=50,
                                random_state=0)

cv_word_matrix_previews, cv_vec_previews = get_cv_word_matrix_vectorizer(previews)

# fit LDA on raw count word matrix
lda.fit(cv_word_matrix_previews)
print('INFO: Finished fitting LDA with raw word matrix, done in %0.3fs.', (time() - t0))

cv_top_features = cv_vec_previews.get_feature_names()
print_top_words(lda, cv_top_features, NUM_TOP_WORDS)

INFO: done converting dataset to TFIDF word matrix, done in 5.561s.
INFO: Finished fitting NNMF with Frobenius Norm and TFIDF word matrix, done in 2.892s.
Topic #0: receipt business total gas fuel valid number store excluding slip bal phone doctored items zoomed spanish french fake receipts english
Topic #1: item function console true false input error brand let result class var price val abp return quantity isvalid hit info
Topic #2: ng p9r shreds cloak u0026awsaccesskeyid u0026expires cipher akiajz57f2vtbzpy6uma 3d signature amazonaws s3 production jpeg media mary id com blankunclearrotatezoom special
Topic #3: msg var conversation_id socket log message function div glass window null url button command css data id magnifier agent_id ui
Topic #4: paper layout flex font _font webkit elevation light blue material common apply grid var smoothing ms deep shadow host purple
Topic #5: nreum function countdown window var seconds exports return minutes init __nr_require performanceobserver pa