## Step 0: Latent Dirichlet Allocation ##

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

## Step 1: Load the dataset

The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `abcnews-date-text.csv` file.

In [58]:
import os
import json
import gzip
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from operator import itemgetter
%matplotlib inline

In [59]:
### load the data

data = []
with gzip.open('Clothing_Shoes_and_Jewelry_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

11285464
{'overall': 5.0, 'vote': '2', 'verified': True, 'reviewTime': '05 4, 2014', 'reviewerID': 'A2IC3NZN488KWK', 'asin': '0871167042', 'style': {'Format:': ' Paperback'}, 'reviewerName': 'Ruby Tulip', 'reviewText': 'This book has beautiful photos, good and understandable directions, and many different kinds of jewelry.  Wire working and metalsmithing jewelry are covered.  Highly recommend this book.', 'summary': 'Unique designs', 'unixReviewTime': 1399161600}


In [246]:
data = pd.DataFrame.from_dict(data)

In [252]:
data[data['asin']=='B000YXC2LI'].sort_values(by=['vote'], ascending=False)


Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
809393,1.0,92,True,"10 2, 2015",A3J4RZ4Z1ZUJPA,B000YXC2LI,"{'Size:': ' 33W x 30L', 'Color:': ' Light Ston...",Mark Clarkson,These Levis 501 jeans have to be fake. They h...,These Levi 501 Jeans must be fake. Don't buy.,1443744000,[https://images-na.ssl-images-amazon.com/image...
1127386,1.0,92,True,"10 2, 2015",A3J4RZ4Z1ZUJPA,B000YXC2LI,"{'Size:': ' 33W x 30L', 'Color:': ' Light Ston...",Mark Clarkson,These Levis 501 jeans have to be fake. They h...,These Levi 501 Jeans must be fake. Don't buy.,1443744000,[https://images-na.ssl-images-amazon.com/image...
1132146,1.0,9,False,"12 1, 2010",A1RQTO7VZVPC5R,B000YXC2LI,,OPD75,These pants are the worst Levis I have ever ow...,Worst ever,1291161600,
1132145,1.0,9,False,"12 7, 2010",A1FI5YI2ROQ4L4,B000YXC2LI,"{'Size:': ' 32W x 30L', 'Color:': ' Black'}",DBH,Item missing a belt loop. Inferior quality to...,Poor qualityI,1291680000,
808399,3.0,9,True,"02 27, 2016",AGS03I2DBC913,B000YXC2LI,"{'Size:': ' 33W x 32L', 'Color:': ' Dark Stone...",Joel,Levi's material quality seems to have decrease...,Not What They Used To Be,1456531200,
...,...,...,...,...,...,...,...,...,...,...,...,...
9641903,2.0,,True,"05 23, 2018",ARVV408QIY4RQ,B000YXC2LI,"{'Size:': ' 38W x 32L', 'Color:': ' Light Ston...",David Monk,"If I had known these were button fly, I would ...",Two Stars,1527033600,
9641904,2.0,,True,"05 23, 2018",A7X0N4M4D6KF0,B000YXC2LI,"{'Size:': ' 34W x 30L', 'Color:': ' Medium Sto...",Dave,Fit much tighter than expected - boot cut as w...,Not a great fit.,1527033600,
9641905,5.0,,True,"05 23, 2018",A3TDQ2NG6BMRTU,B000YXC2LI,"{'Size:': ' 30W x 28L', 'Color:': ' Rinse'}",jz2011,These are high quality jeans and actually fit....,Fits short guys and high quality. Zipper is cool.,1527033600,
9641906,5.0,,True,"05 22, 2018",AIKW2I970KEF6,B000YXC2LI,"{'Size:': ' 30W x 32L', 'Color:': ' Black'}",Laura Kibben,I was absolutely shocked when I opened these t...,I was absolutely shocked when I opened these t...,1526947200,


In [247]:
'''
Load the dataset from the CSV and save it to 'data_text'
'''
# import pandas as pd
# data = pd.read_csv('app_figures_reviews.csv', sep='|', error_bad_lines=False)
# We only need the Headlines text column from the data
data_text = data[['reviewText']]

data_text.loc[:,'index'] = data_text.index
data_text.loc[:,'vote'] = data['vote']
data_text.loc[:,'asin'] = data['asin']
data_text.loc[:,'overall'] = data['overall']
documents = data_text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [248]:
documents.head()

Unnamed: 0,reviewText,index,vote,asin,overall
0,"This book has beautiful photos, good and under...",0,2.0,871167042,5.0
1,Loved their approach in this book and that it ...,1,,871167042,5.0
2,great,2,,871167042,5.0
3,"Always love the way Eva thinks, and there are ...",3,,871167042,5.0
4,Nice patterns,4,,871167042,5.0


Let's glance at the dataset:

In [249]:
'''
Get the total number of documents
'''
print(len(documents))

11285464


In [65]:
documents[:5]

Unnamed: 0,reviewText,index,vote
0,"This book has beautiful photos, good and under...",0,2.0
1,Loved their approach in this book and that it ...,1,
2,great,2,
3,"Always love the way Eva thinks, and there are ...",3,
4,Nice patterns,4,


In [66]:
documents.iloc[[1, 2]]

Unnamed: 0,reviewText,index,vote
1,Loved their approach in this book and that it ...,1,
2,great,2,


In [67]:
data_text.iloc[documents[documents['reviewText'].isna()].index.values]

Unnamed: 0,reviewText,index,vote
482,,482,9
581,,581,3
1479,,1479,
2025,,2025,
2036,,2036,
...,...,...,...
11282031,,11282031,
11282996,,11282996,
11283061,,11283061,
11283342,,11283342,3


In [190]:
documents.head()

Unnamed: 0,index,reviewText
0,1,"If I had wanted blue burlap sack, I would have..."
1,2,"Although advertised as pre-shrunk, these conti..."
2,7,These pants seem to be a bad copy of Levi's\nB...
3,12,Belt loop broke off after wearing once!!!!
4,17,Don't like the way this material looks ! I fee...


In [250]:
#Try out 1 star review for one product
documents = documents[documents['asin']=='B000YXC2LI'].copy()

In [251]:
documents.sort_values(by=['vote'], ascending=False)

Unnamed: 0,reviewText,index,vote,asin,overall
809393,These Levis 501 jeans have to be fake. They h...,809393,92,B000YXC2LI,1.0
1127386,These Levis 501 jeans have to be fake. They h...,1127386,92,B000YXC2LI,1.0
1132146,These pants are the worst Levis I have ever ow...,1132146,9,B000YXC2LI,1.0
1132145,Item missing a belt loop. Inferior quality to...,1132145,9,B000YXC2LI,1.0
808399,Levi's material quality seems to have decrease...,808399,9,B000YXC2LI,3.0
...,...,...,...,...,...
9641903,"If I had known these were button fly, I would ...",9641903,,B000YXC2LI,2.0
9641904,Fit much tighter than expected - boot cut as w...,9641904,,B000YXC2LI,2.0
9641905,These are high quality jeans and actually fit....,9641905,,B000YXC2LI,5.0
9641906,I was absolutely shocked when I opened these t...,9641906,,B000YXC2LI,5.0


In [219]:
documents.head()

Unnamed: 0,reviewText,index,vote,asin,overall
802956,These jeans fit verywell and are tight In all ...,802956,,B000YXC2LI,5.0
802957,"If I had wanted blue burlap sack, I would have...",802957,100.0,B000YXC2LI,1.0
802958,"Although advertised as pre-shrunk, these conti...",802958,12.0,B000YXC2LI,1.0
802959,"I normally don't do reviews, but this time I h...",802959,2.0,B000YXC2LI,5.0
802960,I've been frustrated for the last several year...,802960,5.0,B000YXC2LI,5.0


In [220]:
documents.count()

reviewText    19657
index         19693
vote            508
asin          19693
overall       19693
dtype: int64

In [231]:
documents=documents.reset_index(drop=True)

In [232]:
documents.head()

Unnamed: 0,reviewText,index,vote,asin,overall
0,These jeans fit verywell and are tight In all ...,802956,,B000YXC2LI,5.0
1,"If I had wanted blue burlap sack, I would have...",802957,100.0,B000YXC2LI,1.0
2,"Although advertised as pre-shrunk, these conti...",802958,12.0,B000YXC2LI,1.0
3,"I normally don't do reviews, but this time I h...",802959,2.0,B000YXC2LI,5.0
4,I've been frustrated for the last several year...,802960,5.0,B000YXC2LI,5.0


In [233]:
documents = documents[documents['overall']==1][['reviewText', 'vote', 'asin', 'overall']].reset_index(drop=True)

In [234]:
documents['vote']

0       100
1        12
2         3
3       NaN
4       NaN
       ... 
1675      5
1676     14
1677      8
1678     38
1679      6
Name: vote, Length: 1680, dtype: object

In [242]:
documents.sort_values(by=['vote'], ascending=False)

Unnamed: 0,reviewText,vote,asin,overall
1335,These Levis 501 jeans have to be fake. They h...,92,B000YXC2LI,1.0
495,These Levis 501 jeans have to be fake. They h...,92,B000YXC2LI,1.0
465,"It teared apart in two weeks, it was only used...",9,B000YXC2LI,1.0
823,These pants are the worst Levis I have ever ow...,9,B000YXC2LI,1.0
1663,These pants are the worst Levis I have ever ow...,9,B000YXC2LI,1.0
...,...,...,...,...
1664,"Ever heard the term ""selling from an empty wag...",,B000YXC2LI,1.0
1665,"The jean smells very bad, imagine that i wear ...",,B000YXC2LI,1.0
1667,Don't be as stupid as the Levi Corporation. A...,,B000YXC2LI,1.0
1668,"The material is too hard, shrank after washing...",,B000YXC2LI,1.0


In [235]:
for i in range(10):
    print(i+1, '.', documents['reviewText'][i])


1 . If I had wanted blue burlap sack, I would have looked for it. Levi now thinks they can flog off any material on their customer. The last time I ordered these levis I got something in stripes that looked like I was escaped from prison. Now it is burlap! I will never order anything from Levi again until they come back to the original levi material. What complete junk. This product sucks.
2 . Although advertised as pre-shrunk, these continue to shrink drastically, and the wrinkles are hideous.  Got two different colors, same results.  Ordered the 517's as well, and they stayed true to fit without the wrinkles.
3 . These pants seem to be a bad copy of Levi's
Bad seams
Estos pantalones parecen ser una mala copia de Levi's
Mal acabados en general, malas costuras
4 . Belt loop broke off after wearing once!!!!
5 . Don't like the way this material looks ! I feel as though it's to hard to buy these o. Line because when they get here there not like the ones I get a sears and other places!
6 .

## Step 2: Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.


In [73]:
# !pip install wheel

In [74]:
# !pip install gensim

In [75]:
# !pip install nltk

In [76]:
'''
Loading Gensim and nltk libraries
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [77]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/sxu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Lemmatizer Example
Before preprocessing our dataset, let's first look at an lemmatizing example. What would be the output if we lemmatized the word 'went':

In [78]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense

go


### Stemmer Example
Let's also look at a stemming example. Let's throw a number of words at the stemmer and see how it deals with each one:

In [79]:
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [80]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # TODO: Apply lemmatize_stemming() on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    return result



In [81]:
'''
Preview a document after preprocessing
'''
document_num = 12

doc_sample = documents[documents['index'] == document_num].values[0][1]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['Belt', 'loop', 'broke', 'off', 'after', 'wearing', 'once!!!!']


Tokenized and lemmatized document: 
['belt', 'loop', 'break', 'wear']


In [82]:
documents = documents[documents['reviewText'].notna()]

Let's now preprocess all the news headlines we have. To do that, let's use the [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function from pandas to apply `preprocess()` to the `headline_text` column

**Note**: This may take a few minutes (it take 6 minutes on my laptop)

In [83]:
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = documents['reviewText'].map(preprocess)

In [84]:
'''
Preview 'processed_docs'
'''
processed_docs[:10][0]

['want',
 'blue',
 'burlap',
 'sack',
 'look',
 'levi',
 'think',
 'flog',
 'materi',
 'custom',
 'time',
 'order',
 'levi',
 'strip',
 'look',
 'like',
 'escap',
 'prison',
 'burlap',
 'order',
 'levi',
 'come',
 'origin',
 'levi',
 'materi',
 'complet',
 'junk',
 'product',
 'suck']

## Step 3.1: Bag of words on the dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass `processed_docs` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [85]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [86]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 blue
1 burlap
2 come
3 complet
4 custom
5 escap
6 flog
7 junk
8 levi
9 like
10 look


In [87]:
len(dictionary)

2073

** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [88]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
# dictionary.filter_extremes(no_below=15, no_above=0.1)

'\nOPTIONAL STEP\nRemove very rare and very common words:\n\n- words appearing less than 15 times\n- words appearing in more than 10% of all documents\n'

In [89]:
# len(dictionary)

** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [90]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
# TODO
bow_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in processed_docs]
bow_corpus[:5]

[[(0, 1),
  (1, 2),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 4),
  (9, 1),
  (10, 2),
  (11, 2),
  (12, 2),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1)],
 [(12, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 2),
  (30, 1),
  (31, 1),
  (32, 2)],
 [(8, 2),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 2),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1)],
 [(44, 1), (45, 1), (46, 1), (47, 1)],
 [(9, 2),
  (10, 1),
  (11, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1)]]

In [91]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
bow_corpus[document_num]

[(8, 1), (25, 1), (31, 1), (40, 1), (122, 1)]

In [92]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[documents[documents['index']==document_num].index.values[0]]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 44 ("belt") appears 1 time.
Word 45 ("break") appears 1 time.
Word 46 ("loop") appears 1 time.
Word 47 ("wear") appears 1 time.


In [93]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 0 which we have checked in Step 2
bow_doc_0 = bow_corpus[0]

for i in range(len(bow_doc_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_0[i][0], 
                                                     dictionary[bow_doc_0[i][0]], 
                                                     bow_doc_0[i][1]))

Word 0 ("blue") appears 1 time.
Word 1 ("burlap") appears 2 time.
Word 2 ("come") appears 1 time.
Word 3 ("complet") appears 1 time.
Word 4 ("custom") appears 1 time.
Word 5 ("escap") appears 1 time.
Word 6 ("flog") appears 1 time.
Word 7 ("junk") appears 1 time.
Word 8 ("levi") appears 4 time.
Word 9 ("like") appears 1 time.
Word 10 ("look") appears 2 time.
Word 11 ("materi") appears 2 time.
Word 12 ("order") appears 2 time.
Word 13 ("origin") appears 1 time.
Word 14 ("prison") appears 1 time.
Word 15 ("product") appears 1 time.
Word 16 ("sack") appears 1 time.
Word 17 ("strip") appears 1 time.
Word 18 ("suck") appears 1 time.
Word 19 ("think") appears 1 time.
Word 20 ("time") appears 1 time.
Word 21 ("want") appears 1 time.


## Step 3.2: TF-IDF on our document set ##

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

*Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.*

** TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [94]:
'''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models

# TODO

tfidf = models.TfidfModel(bow_corpus)  # fit model
# vector = model[corpus[0]] 

In [95]:
tfidf

<gensim.models.tfidfmodel.TfidfModel at 0x1252a2050>

In [96]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
# TODO
corpus_tfidf = tfidf[bow_corpus]

In [97]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.1516696682342415),
 (1, 0.5536927100634086),
 (2, 0.11208334169158266),
 (3, 0.1628503253747485),
 (4, 0.16035772121348466),
 (5, 0.2768463550317043),
 (6, 0.2768463550317043),
 (7, 0.14450106304865498),
 (8, 0.14730347526937884),
 (9, 0.06472985315974981),
 (10, 0.18186262895757666),
 (11, 0.1686775320503952),
 (12, 0.15933124066902066),
 (13, 0.11518786315414174),
 (14, 0.2768463550317043),
 (15, 0.09275896544039047),
 (16, 0.2768463550317043),
 (17, 0.2768463550317043),
 (18, 0.19134933278898741),
 (19, 0.10396883421741349),
 (20, 0.08284349663662864),
 (21, 0.11208334169158266)]


## Step 4.1: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

** We will be running LDA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.

* ** passes ** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999` 

In [98]:
bow_corpus

[[(0, 1),
  (1, 2),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 4),
  (9, 1),
  (10, 2),
  (11, 2),
  (12, 2),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1)],
 [(12, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 2),
  (30, 1),
  (31, 1),
  (32, 2)],
 [(8, 2),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 2),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1)],
 [(44, 1), (45, 1), (46, 1), (47, 1)],
 [(9, 2),
  (10, 1),
  (11, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1)],
 [(9, 1), (54, 1), (55, 1)],
 [(8, 1), (29, 1), (47, 1), (50, 1), (56, 2), (57, 1), (58, 1), (59, 1)],
 [(8, 1),
  (43, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1)],
 [(2, 2), (40, 1), (67, 1), (68, 1), (69, 1), (70, 2), (71, 1), (72, 1)],
 [(8, 1),
  (40, 1),
  (43, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
 

In [99]:
from datetime import datetime

# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 

'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
time1 = datetime.now()
lda_model = gensim.models.LdaMulticore(bow_corpus,
                                       num_topics=2,
                                       id2word=dictionary, 
                                       passes=50,
                                       minimum_probability=0.02,
                                       random_state=41)
delta_time = datetime.now() - time1
print("LDA took {}s to run".format(delta_time))

LDA took 0:01:10.748012s to run


In [100]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(num_topics=-1, num_words=5):
    print("Topic: {} \nWords: {}".format(topic, idx ))
    print("\n")

Topic: 0.034*"levi" + 0.029*"size" + 0.028*"pair" + 0.027*"wear" + 0.022*"jean" 
Words: 0


Topic: 0.042*"levi" + 0.041*"jean" + 0.021*"pair" + 0.021*"button" + 0.014*"buy" 
Words: 1




In [101]:
# bow_corpus[documents[documents['index']==document_num].index.values[0]]

In [102]:
# print(sample_doc_topics)

In [103]:
# [dictionary[bow_doc_4310[i][0]] for i in range(len(bow_doc_4310))]

In [104]:
# documents.iloc[3]

In [105]:
# documents[documents['index']==document_num]

In [106]:
# Classify all docs into their topics
doc_topics = lda_model.get_document_topics(bow=bow_corpus[3], minimum_probability=None, minimum_phi_value=None, per_word_topics=False)

In [107]:
bow_corpus[0]

[(0, 1),
 (1, 2),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 4),
 (9, 1),
 (10, 2),
 (11, 2),
 (12, 2),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 1)]

In [108]:
bow_doc_0 = bow_corpus[1]

for i in range(len(bow_doc_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_0[i][0], 
                                                     dictionary[bow_doc_0[i][0]], 
                                                     bow_doc_0[i][1]))

Word 12 ("order") appears 1 time.
Word 22 ("advertis") appears 1 time.
Word 23 ("color") appears 1 time.
Word 24 ("continu") appears 1 time.
Word 25 ("differ") appears 1 time.
Word 26 ("drastic") appears 1 time.
Word 27 ("hideous") appears 1 time.
Word 28 ("result") appears 1 time.
Word 29 ("shrink") appears 2 time.
Word 30 ("stay") appears 1 time.
Word 31 ("true") appears 1 time.
Word 32 ("wrinkl") appears 2 time.


In [109]:
documents.head()

Unnamed: 0,index,reviewText
0,1,"If I had wanted blue burlap sack, I would have..."
1,2,"Although advertised as pre-shrunk, these conti..."
2,7,These pants seem to be a bad copy of Levi's\nB...
3,12,Belt loop broke off after wearing once!!!!
4,17,Don't like the way this material looks ! I fee...


In [110]:
documents.iloc[2]['reviewText']

"These pants seem to be a bad copy of Levi's\nBad seams\nEstos pantalones parecen ser una mala copia de Levi's\nMal acabados en general, malas costuras"

In [111]:
sample_doc_topics, topics_per_word, phi = lda_model.get_document_topics(bow=bow_corpus[2], minimum_probability=None, minimum_phi_value=None, per_word_topics=True)

In [112]:
sample_doc_topics

[(0, 0.9569577), (1, 0.043042302)]

In [113]:
for word_idx, topic_idx in topics_per_word:
    print("{} belongs to topic {}".format(dictionary[word_idx], topic_idx))

levi belongs to topic [0, 1]
acabado belongs to topic [0]
copi belongs to topic [0]
copia belongs to topic [0]
costura belongs to topic [0]
esto belongs to topic [0]
general belongs to topic [0]
mala belongs to topic [0]
pant belongs to topic [0]
pantalon belongs to topic [0]
parecen belongs to topic [0]
seam belongs to topic [0, 1]


In [115]:
lda_model.print_topic(1)

'0.042*"levi" + 0.041*"jean" + 0.021*"pair" + 0.021*"button" + 0.014*"buy" + 0.013*"like" + 0.012*"amazon" + 0.011*"materi" + 0.010*"wear" + 0.010*"product"'

In [116]:
lda_model.print_topic(0, topn=50)

'0.034*"levi" + 0.029*"size" + 0.028*"pair" + 0.027*"wear" + 0.022*"jean" + 0.017*"year" + 0.015*"qualiti" + 0.014*"return" + 0.013*"like" + 0.013*"pant" + 0.012*"order" + 0.012*"buy" + 0.011*"wash" + 0.010*"small" + 0.010*"waist" + 0.010*"time" + 0.007*"belt" + 0.007*"inch" + 0.007*"shrink" + 0.006*"differ" + 0.006*"purchas" + 0.006*"loop" + 0.006*"tight" + 0.005*"disappoint" + 0.005*"long" + 0.005*"go" + 0.005*"materi" + 0.005*"good" + 0.004*"look" + 0.004*"smaller" + 0.004*"fabric" + 0.004*"think" + 0.004*"leg" + 0.004*"come" + 0.004*"price" + 0.004*"brand" + 0.004*"month" + 0.004*"wast" + 0.004*"break" + 0.004*"tri" + 0.003*"product" + 0.003*"denim" + 0.003*"right" + 0.003*"larg" + 0.003*"color" + 0.003*"horribl" + 0.003*"fit" + 0.003*"money" + 0.003*"length" + 0.003*"work"'

In [117]:
from operator import itemgetter
max(doc_topics,key=itemgetter(1))

(0, 0.8939507)

In [118]:
reviews2topics = {}
for i in range(len(bow_corpus)):    
    review_topic = lda_model.get_document_topics(bow=bow_corpus[i], minimum_probability=None, minimum_phi_value=None, per_word_topics=False)
    reviews2topics[i] = max(review_topic,key=itemgetter(1))[0]
# print(reviews2topics)

In [153]:
len(reviews2topics)

1680

In [119]:

for doc_index, topic in reviews2topics.items():
    if topic ==0:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

Although advertised as pre-shrunk, these continue to shrink drastically, and the wrinkles are hideous.  Got two different colors, same results.  Ordered the 517's as well, and they stayed true to fit without the wrinkles.


These pants seem to be a bad copy of Levi's
Bad seams
Estos pantalones parecen ser una mala copia de Levi's
Mal acabados en general, malas costuras


Belt loop broke off after wearing once!!!!


Pants came a little rip and it came with a bad smell I wash them twice and smell is going away.


I love me some Levis but when you wash them and one pant leg starts twisting as it nears your ankles and the side seam is facing the front it an unpleasant fashion flaw. Can I exchange them for a new pair???


Im not happy at all, these so-called Levis are ridiculously light weight. I can see through the jeans and limited fabrics. I see the holes where the tags were punched through. Now I have to waste time returning which irritates me big time! There should be consequences for 

The size is not a regular size its smallest that what the normal size is


The crotch stitching ripped in the first month of wear. Fit nicely but didn't even get a chance to break them in.


the hem is half the size it should be, the thread is the wrong color and thickness is wrong, really unhappy....


I have worn levi 501s for years and they were great. But I Just got some new ones and they are junk the quality is terrible nothing like they use to be. Maybe if more people complained they would wake up and make them like they use to. I WILL NEVER BUY LEVIS AGAIN IF THEO KEEP MAKING THIS JUNK. THE QUALITY DOES NOT MATCH THEO HIGH PRICE


This product has permanent wrinkles in the fabric. Looks like Manufacture reject to me.  Will buy in store next time. BEWARE


I own many pairs of 501 Jeans but this pair was the first to shrink to to the point I could not wear them. I wore them out of the box and they were great. After a first wash on cold and hang dry they were unwearable. I could no

In [120]:
for doc_index, topic in reviews2topics.items():
    if topic ==1:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

If I had wanted blue burlap sack, I would have looked for it. Levi now thinks they can flog off any material on their customer. The last time I ordered these levis I got something in stripes that looked like I was escaped from prison. Now it is burlap! I will never order anything from Levi again until they come back to the original levi material. What complete junk. This product sucks.


Don't like the way this material looks ! I feel as though it's to hard to buy these o. Line because when they get here there not like the ones I get a sears and other places!


The 29 wait was more like a 30. Not happy


Always worn levi,s. Bring it back to, united states to fit.shrinks while from shipping from the line..bring back the 529 also..


Fit great..  BUT ..  The crotch ripped down the inner leg seam , I have had the Levis for just 3 months.  Very disappointed and dissatisfied!!


The denim has failed (holes forming) in two spots in the crotch area. I have worn these maybe 10 times, washed ma


The Levis Red tab on back pocket does not spell out Levis symbol on tab like original jeans & as advertised. The red tab just has an R. Though Levis site says blank R tabs are authentic and they trademark red tab alone without Levis symbol, its still very disappointing. Otherwise jeans check out: Heavy fabric, copper rivets on pockets, uniform orange stitching, Levis stamp on inner pocket. And FYI, made in Egypt, not the most valuable Made in USA like originals. Still more valuable being made in China or Mexico.


I cannot fathom who would prefer button flys over zipper so I did not even think to check the description for it.


I ordered 505 jeans not 501s I returned both pairs very disappointed in the order I made it very plain what I wanted


Horrible mismarked size!!!!!! Also button fly and was not advertised as button fly!!!! Seems like  what's being sold are the styles that did not sell well in stores


I bought these Jeans in August and by October, they had ripped in the crotch 



I always bought Levis 501 on amazon, but this one looks FULL fake.
Last pair of these was genuine.
I noticed the tag loks like scanned without quiality and detail in the logo, the end of the pants too short stitch.
Very disappointed


Stay away if you are looking for true Levi's which are NOT made of lightweight fabric like these.  They are either fake, or Levi's has so radically changed their jeans that I will not buy again.


At first look I thought these were counterfeit because the denim was not rigid as shrink to fits are. But on closer examination, these are seconds. They fit poorly and the denim is way too thin. I'd hate to think these are first quality and Levi's are going to the dogs, but reading other reviews, it appears a lot of their jeans are pretty shoddy.


Six months ago I purchased 4 pair of Stonewashed 501 Jeans.  I thought they would last a year at least.  Well, they have not only torn and ripped in the pocket/waist area, but you can actually see through the fabric

I was very unhappy with this purchase; I will no longer purchase clothes from Amazon.  I was looking for a pair of shrink to fit jeans in my hard-to-find size.  The Amazon web page that came up advertised Levi's 501 shrink-to-fit jeans in the size, and I proceeded to select a pair.  The receipt mentioned Levi's 501s, but didn't specifically say shrink-to-fit.  I wasn't sure if it should and I trusted the receipt would actually reflect what the web page advertised.  The jeans arrived in a couple days, but were not shrink-to-fit.  The return process was relatively easy and painless, which is why I will continue to buy non-clothes items from Amazon.  However, the online resolution screen haughtily claimed that since the purchasing error was my fault, Amazon would not pay for the shipping.  There are too many unknowns in fit when purchasing clothes from potentially erroneous web sites for me to repeat this exercise.


If I had wanted blue burlap sack, I would have looked for it. Levi now t

Will never again get caught up again with the Levis (take you to the cleaners) ruination. It seems that CEO and corporate greed have taken away one of the all time great names in clothing. Goodbye Levis, and give my regards to Satan while you are there!


i want  zippers,this one i don't liked,is possible the change.when you show the sample we don't can see the zipper;thank you.


I bought 2 pairs of levis 501,s at sears two weeks ago. The pair i bought from amazon was a cheap imitation of those. You can tell there is a HUGE DIFFRENCE IN THE WEIGHT.tHE AMAZON ONES ARE CHEAP AND NOT WORTH THE 10 DOLLAR SAVINGS. PAY MORE AND BUY FROM SEARS ARE SOMEPLACE ELSE THAT SELLS REAL LEVIS


I am returning this. Material is way too thin. I had to resort to online order only because shops didn't seem to carry 29" in seam length. They seem to stock even numbers only. Also why don't you mention BUTTON FLY in bold on the title. Do they still make these in the zipper era ?


I have no idea what these a


Don't be as stupid as the Levi Corporation.  Although Levi jeans manufactured in the 1950's sell for thousands of dollars in New York, the company has relentlessly degraded the product for 40 years now, trading on their name alone.  They are actually so stupid as to try to compete with the slaves in a Chinese concentration camp by accepting products manufactured in Africa and South America.  The original rivets are gone, their "denim" fabric is the worst quality; all seams fray, especially cuffs; belt loops fall apart; and the only appearance achieved is shabby tissue.  These fools do not have the intelligence to realize that people would pay hundreds of dollars for jeans produced on the design, material, and workmanship of the 1940's. Stop living in the past.


The material is too hard, shrank after washing, do not meet the stated features, I do not recommend purchasing this bluejean, sorry I wasted my money on this product and the worst is that nobody is responsible for the poor qua

In [121]:
for doc_index, topic in reviews2topics.items():
    if topic ==2:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

In [122]:
for doc_index, topic in reviews2topics.items():
    if topic ==3:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

In [123]:
for doc_index, topic in reviews2topics.items():
    if topic ==4:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: 
* 1: 
* 2: 
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9: 

### Extract the most upvoted reviews for each topic, and display these to users

In [236]:
documents.count()

reviewText    1680
vote           260
asin          1680
overall       1680
dtype: int64

In [237]:
topic_review_df = documents.copy()
topic_review_df['topic'] = pd.Series(reviews2topics)

In [238]:
topic_review_df.head(12)

Unnamed: 0,reviewText,vote,asin,overall,topic
0,"If I had wanted blue burlap sack, I would have...",100.0,B000YXC2LI,1.0,1
1,"Although advertised as pre-shrunk, these conti...",12.0,B000YXC2LI,1.0,0
2,These pants seem to be a bad copy of Levi's\nB...,3.0,B000YXC2LI,1.0,0
3,Belt loop broke off after wearing once!!!!,,B000YXC2LI,1.0,0
4,Don't like the way this material looks ! I fee...,,B000YXC2LI,1.0,1
5,The 29 wait was more like a 30. Not happy,,B000YXC2LI,1.0,1
6,"Always worn levi,s. Bring it back to, united s...",,B000YXC2LI,1.0,1
7,Fit great.. BUT .. The crotch ripped down th...,,B000YXC2LI,1.0,1
8,Pants came a little rip and it came with a bad...,,B000YXC2LI,1.0,0
9,I love me some Levis but when you wash them an...,,B000YXC2LI,1.0,0


In [239]:
topic_review_df[topic_review_df['topic']==0]

Unnamed: 0,reviewText,vote,asin,overall,topic
1,"Although advertised as pre-shrunk, these conti...",12,B000YXC2LI,1.0,0
2,These pants seem to be a bad copy of Levi's\nB...,3,B000YXC2LI,1.0,0
3,Belt loop broke off after wearing once!!!!,,B000YXC2LI,1.0,0
8,Pants came a little rip and it came with a bad...,,B000YXC2LI,1.0,0
9,I love me some Levis but when you wash them an...,,B000YXC2LI,1.0,0
...,...,...,...,...,...
1666,I had to return them. Completely different fro...,10,B000YXC2LI,1.0,0
1669,I have been wearing Levi 501 shrink-to-fit for...,560,B000YXC2LI,1.0,0
1670,I have been buying the sized 501's (not the Sh...,566,B000YXC2LI,1.0,0
1671,"<a data-hook=""product-link-linked"" class=""a-li...",365,B000YXC2LI,1.0,0


#### There are dupes for this product reviews

In [253]:
topic_review_df[topic_review_df['topic']==0].sort_values(by=['vote'], ascending=False).drop_duplicates(['reviewText', 'overall', 'vote']).head(10)

Unnamed: 0,reviewText,vote,asin,overall,topic
1305,"It teared apart in two weeks, it was only used...",9,B000YXC2LI,1.0,0
369,For a person to have a 38 inch waist and be ab...,8,B000YXC2LI,1.0,0
724,I ordered 3 pairs of pants at the same time fr...,7,B000YXC2LI,1.0,0
830,I have been buying the sized 501's (not the Sh...,566,B000YXC2LI,1.0,0
1669,I have been wearing Levi 501 shrink-to-fit for...,560,B000YXC2LI,1.0,0
802,I've been wearing 501s since I was a kid - abo...,55,B000YXC2LI,1.0,0
1174,"Very disappointed. Over the last 3-4 years, I ...",5,B000YXC2LI,1.0,0
820,"<div id=""video-block-R1465PFSIOTTT0"" class=""a-...",5,B000YXC2LI,1.0,0
166,I've purchased two pairs of 501's in the same ...,4,B000YXC2LI,1.0,0
726,I've worn the orig shrink to fit for years & h...,4,B000YXC2LI,1.0,0


In [254]:
topic_review_df[topic_review_df['topic']==1].sort_values(by=['vote'], ascending=False).drop_duplicates(['reviewText', 'overall', 'vote']).head(10)

Unnamed: 0,reviewText,vote,asin,overall,topic
495,These Levis 501 jeans have to be fake. They h...,92,B000YXC2LI,1.0,1
1663,These pants are the worst Levis I have ever ow...,9,B000YXC2LI,1.0,1
1662,Item missing a belt loop. Inferior quality to...,9,B000YXC2LI,1.0,1
1677,These Jeans are made with a thin worn out deni...,8,B000YXC2LI,1.0,1
1645,"Can't believe it, but I guess times change and...",680,B000YXC2LI,1.0,1
1672,Real 501's are made of 14 oz canvas-like mater...,644,B000YXC2LI,1.0,1
1413,So just got my order in today and guess what. ...,62,B000YXC2LI,1.0,1
1458,Remember when you went in to a store to buy yo...,6,B000YXC2LI,1.0,1
1679,I was very unhappy with this purchase; I will ...,6,B000YXC2LI,1.0,1
1080,"Have ordered jeans from Amazon before, never a...",5,B000YXC2LI,1.0,1


## Step 4.2 Running LDA using TF-IDF ##

In [124]:
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''
# TODO
time2 = datetime.now()
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf,
                                       num_topics=3,
                                       id2word=dictionary, 
                                       passes=50)
delta_time2 = datetime.now() - time2
print("LDA took {}s to run".format(delta_time2))

LDA took 0:01:00.897449s to run


In [125]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.009*"wash" + 0.008*"color" + 0.008*"jean" + 0.007*"button" + 0.007*"zipper" + 0.007*"pair" + 0.006*"want" + 0.006*"levi" + 0.006*"order" + 0.006*"differ"


Topic: 1 Word: 0.010*"qualiti" + 0.010*"levi" + 0.009*"size" + 0.008*"wear" + 0.008*"like" + 0.008*"pair" + 0.008*"year" + 0.008*"jean" + 0.007*"expect" + 0.007*"pant"


Topic: 2 Word: 0.029*"small" + 0.014*"return" + 0.012*"button" + 0.010*"jean" + 0.010*"size" + 0.008*"tight" + 0.008*"pair" + 0.007*"waist" + 0.007*"long" + 0.006*"levi"




### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, what categories could you find?

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

In [126]:
# Classify all docs into their topics
reviews2topics_tfidf = {}
for i in range(len(corpus_tfidf)):    
    review_topic_tfidf = lda_model_tfidf.get_document_topics(bow=corpus_tfidf[i], minimum_probability=None, minimum_phi_value=None, per_word_topics=False)
    reviews2topics_tfidf[i] = max(review_topic_tfidf,key=itemgetter(1))[0]
# print(reviews2topics)

In [127]:
for doc_index, topic in reviews2topics_tfidf.items():
    if topic ==0:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

If I had wanted blue burlap sack, I would have looked for it. Levi now thinks they can flog off any material on their customer. The last time I ordered these levis I got something in stripes that looked like I was escaped from prison. Now it is burlap! I will never order anything from Levi again until they come back to the original levi material. What complete junk. This product sucks.


Although advertised as pre-shrunk, these continue to shrink drastically, and the wrinkles are hideous.  Got two different colors, same results.  Ordered the 517's as well, and they stayed true to fit without the wrinkles.


Don't like the way this material looks ! I feel as though it's to hard to buy these o. Line because when they get here there not like the ones I get a sears and other places!


The 29 wait was more like a 30. Not happy


Chinese crap.  NOTHING like the quality you expect from Levis.

Inside waist hem is raised and VERY rough (literally:  SHARP) and injures skin if you don't have a s

In [128]:
counter= 0
for doc_index, topic in reviews2topics_tfidf.items():
    if topic ==1 and counter <=20:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')
        counter+=1

These pants seem to be a bad copy of Levi's
Bad seams
Estos pantalones parecen ser una mala copia de Levi's
Mal acabados en general, malas costuras


Belt loop broke off after wearing once!!!!


Pants came a little rip and it came with a bad smell I wash them twice and smell is going away.


Im not happy at all, these so-called Levis are ridiculously light weight. I can see through the jeans and limited fabrics. I see the holes where the tags were punched through. Now I have to waste time returning which irritates me big time! There should be consequences for this type of nonsense for the crooks that allow this trying to ripoff customers. How about just giving me the item I chose & paid for!


These pants seemed different then most Levis. Not true to size .


The back belt loop broke out of the bottom stitching after only a few times wearing the pants.  I won't buy this product again.


I purchased 2 pairs of size 30 x 30 501's on amazon within the past few months.  The "rinse" fits no

### TFIDF LDA with different params

In [129]:
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''
# TODO
time3 = datetime.now()
lda_model_tfidf_2 = gensim.models.LdaMulticore(corpus_tfidf,
                                       num_topics=8,
                                       id2word=dictionary, 
                                       passes=150)
delta_time3 = datetime.now() - time3
print("LDA took {}s to run".format(delta_time3))

LDA took 0:02:20.260503s to run


In [130]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf_2.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.011*"size" + 0.010*"levi" + 0.010*"jean" + 0.009*"wear" + 0.009*"buy" + 0.008*"shrink" + 0.008*"pant" + 0.008*"color" + 0.007*"pair" + 0.007*"time"


Topic: 1 Word: 0.012*"poor" + 0.010*"return" + 0.010*"pant" + 0.009*"longer" + 0.009*"levi" + 0.008*"qualiti" + 0.008*"jean" + 0.007*"fabric" + 0.007*"waist" + 0.007*"wash"


Topic: 2 Word: 0.021*"button" + 0.013*"return" + 0.012*"zipper" + 0.009*"jean" + 0.009*"fake" + 0.008*"skinni" + 0.008*"instead" + 0.008*"wrong" + 0.008*"size" + 0.008*"order"


Topic: 3 Word: 0.010*"bigger" + 0.007*"small" + 0.006*"pant" + 0.006*"give" + 0.005*"christma" + 0.005*"return" + 0.005*"jean" + 0.005*"size" + 0.005*"send" + 0.005*"horribl"


Topic: 4 Word: 0.018*"expect" + 0.011*"send" + 0.009*"want" + 0.009*"like" + 0.008*"stitch" + 0.008*"zipper" + 0.007*"pair" + 0.007*"chang" + 0.007*"button" + 0.006*"jean"


Topic: 5 Word: 0.075*"small" + 0.015*"larg" + 0.011*"good" + 0.010*"size" + 0.009*"right" + 0.008*"return" + 0.006*"reject" + 0.0

##### Topic classification

In [131]:
# Classify all docs into their topics
reviews2topics_tfidf_2 = {}
for i in range(len(corpus_tfidf)):    
    review_topic_tfidf_2 = lda_model_tfidf_2.get_document_topics(bow=corpus_tfidf[i], minimum_probability=None, minimum_phi_value=None, per_word_topics=False)
    reviews2topics_tfidf_2[i] = max(review_topic_tfidf_2,key=itemgetter(1))[0]
# print(reviews2topics)

In [132]:
for doc_index, topic in reviews2topics_tfidf_2.items():
    if topic ==0:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

Although advertised as pre-shrunk, these continue to shrink drastically, and the wrinkles are hideous.  Got two different colors, same results.  Ordered the 517's as well, and they stayed true to fit without the wrinkles.


Always worn levi,s. Bring it back to, united states to fit.shrinks while from shipping from the line..bring back the 529 also..


Fit great..  BUT ..  The crotch ripped down the inner leg seam , I have had the Levis for just 3 months.  Very disappointed and dissatisfied!!


Chinese crap.  NOTHING like the quality you expect from Levis.

Inside waist hem is raised and VERY rough (literally:  SHARP) and injures skin if you don't have a shirt tucked in (and even then...).

Belt loops broke about the third time I wore them.


What is going on with sizing here?  I ordered 32x34 and the jeans I received are comically short.  Haven't washed them or anything.  I've bought so many pairs of pants in this size, including from Levi's, and never seen anything like this.  I compa

In [133]:
for doc_index, topic in reviews2topics_tfidf_2.items():
    if topic ==1:
        print(documents.iloc[doc_index]['reviewText'])
        print('\n')

If I had wanted blue burlap sack, I would have looked for it. Levi now thinks they can flog off any material on their customer. The last time I ordered these levis I got something in stripes that looked like I was escaped from prison. Now it is burlap! I will never order anything from Levi again until they come back to the original levi material. What complete junk. This product sucks.


The 29 wait was more like a 30. Not happy


Pants came a little rip and it came with a bad smell I wash them twice and smell is going away.


I love me some Levis but when you wash them and one pant leg starts twisting as it nears your ankles and the side seam is facing the front it an unpleasant fashion flaw. Can I exchange them for a new pair???


I bought 2 pairs of jeans and returned them both in the same box.  Only one pair was recieved back???


36x36 is actually 39x37.  way too big. they don't shrink at all when washed too. what a waste. will try to return


I've purchased hundreds of this style

## Step 5.1: Performance evaluation by classifying sample document using LDA Bag of Words model##

We will check to see where our test document would be classified. 

In [135]:
'''
Text of sample document 4310
'''
processed_docs[1]

['advertis',
 'shrink',
 'continu',
 'shrink',
 'drastic',
 'wrinkl',
 'hideous',
 'differ',
 'color',
 'result',
 'order',
 'stay',
 'true',
 'wrinkl']

In [137]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''
document_num = 1
# Our test document is document number 4310

# TODO
# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.9553169012069702	 
Topic: 0.034*"levi" + 0.029*"size" + 0.028*"pair" + 0.027*"wear" + 0.022*"jean" + 0.017*"year" + 0.015*"qualiti" + 0.014*"return" + 0.013*"like" + 0.013*"pant"

Score: 0.0446830652654171	 
Topic: 0.042*"levi" + 0.041*"jean" + 0.021*"pair" + 0.021*"button" + 0.014*"buy" + 0.013*"like" + 0.012*"amazon" + 0.011*"materi" + 0.010*"wear" + 0.010*"product"


### It has the highest probability (`x`) to be  part of the topic that we assigned as X, which is the accurate classification. ###

## Step 5.2: Performance evaluation by classifying sample document using LDA TF-IDF model##

In [138]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.9449763894081116	 
Topic: 0.009*"wash" + 0.008*"color" + 0.008*"jean" + 0.007*"button" + 0.007*"zipper" + 0.007*"pair" + 0.006*"want" + 0.006*"levi" + 0.006*"order" + 0.006*"differ"

Score: 0.03146381303668022	 
Topic: 0.010*"qualiti" + 0.010*"levi" + 0.009*"size" + 0.008*"wear" + 0.008*"like" + 0.008*"pair" + 0.008*"year" + 0.008*"jean" + 0.007*"expect" + 0.007*"pant"

Score: 0.023559778928756714	 
Topic: 0.029*"small" + 0.014*"return" + 0.012*"button" + 0.010*"jean" + 0.010*"size" + 0.008*"tight" + 0.008*"pair" + 0.007*"waist" + 0.007*"long" + 0.006*"levi"


### It has the highest probability (`x%`) to be  part of the topic that we assigned as X. ###

## Step 6: Testing model on unseen document ##

In [139]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.7896367311477661	 Topic: 0.034*"levi" + 0.029*"size" + 0.028*"pair" + 0.027*"wear" + 0.022*"jean"
Score: 0.21036328375339508	 Topic: 0.042*"levi" + 0.041*"jean" + 0.021*"pair" + 0.021*"button" + 0.014*"buy"


The model correctly classifies the unseen document with 'x'% probability to the X category.