# Baseline Modeling #

We will be using an unsupervised approach to construct a topic model using Latent Dirichlet Allocation (LDA).
In the present notebook, we will complete the following steps:
1. Preprocess the review data
2. Construct vector representation of the preprocessed data
3. Construct multiple initial LDA topic models using various sample sizes from the data

In [102]:
# Imports
import gensim
import nltk
import numpy as np
import pandas as pd
import pyLDAvis.gensim
import scipy
import string

from gensim.models.ldamulticore import LdaMulticore
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

In [101]:
# Import data as DataFrame
r_df = pd.read_csv('/Users/dwalkerpage/Documents/Data_Science/Springboard/Projects/springboard/Capstone_Projects/Capstone_Project_2/Code/cleaned_restaurants_final.csv')

## 1. Data Preprocessing ##

**Steps:**
* Remove punctuation
* Convert words to lowercase
* Remove stopwords
* Lemmatize words
    * As noted [here](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py), we use lemmatization and not stemming because lemmatization produces more readable words, and readability is a desirable property for the output in topic modelling.

In [84]:
# Define function to remove punctuation from a string
# From here: https://stackoverflow.com/questions/33047818/remove-punctuation-for-each-row-in-a-pandas-data-frame?noredirect=1&lq=1

def remove_punctuation(s):
    '''Removes punctuation from a string'''
    s = ''.join([i for i in s if i not in set(string.punctuation)])
    return s

In [85]:
stop_words = set(stopwords.words('english'))

In [86]:
# Define function to remove stop words, using NLTK's list of stop words

def remove_stopwords(s):
    '''Removes stopwords from a string'''
    s = ' '.join([word for word in s.split() if word not in stop_words])
    return s

In [87]:
def wordnet_tag(nltk_tag):
    '''Return relevant tag for WordNetLemmatizer'''
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:                    
        return None

In [88]:
def lemmatize(s):
    '''Lemmatizes words in a string'''
    lemmatizer = WordNetLemmatizer()
    tokenized_words = word_tokenize(s)
    tagged_words = nltk.pos_tag(tokenized_words)
    wordnet_tagged_words = map(lambda x: (x[0], wordnet_tag(x[1])), tagged_words)
    lemmatized_words = []
    for word, tag in wordnet_tagged_words:
        if tag is None:
            lemmatized_words.append(word)
        else:
            word = lemmatizer.lemmatize(word, tag)
            lemmatized_words.append(word)
    output = ' '.join(lemmatized_words)
    return output

Examples of lemmatized sentences:

In [89]:
lemmatize("John's big idea isn't all that bad.")

"John 's big idea be n't all that bad ."

In [90]:
lemmatize("The striped bats are hanging on their feet for best")

'The striped bat be hang on their foot for best'

Compare a review from the data set with its lemmatized counterpart:

Original Review:

In [91]:
r_df.text.loc[1]

"We've always been there on a Sunday so we were hoping that Saturday dim sum would be less busy. No such luck. We were surprised that some of the dishes were cold because it was so packed; I could understand if it was empty and the carts weren't circulating but every table was full. It took a while to get drinks and other items (napkins). The dishes were not of the same quality as they had been on other visits, but they were acceptable."

Lemmatized Version:

In [92]:
lemmatize(r_df.text.loc[1])

"We 've always be there on a Sunday so we be hop that Saturday dim sum would be less busy . No such luck . We be surprise that some of the dish be cold because it be so packed ; I could understand if it be empty and the cart be n't circulate but every table be full . It take a while to get drink and other item ( napkin ) . The dish be not of the same quality as they have be on other visit , but they be acceptable ."

In [94]:
def preprocess(df):
    df['text'] = df['text'].str.lower()
    tqdm.pandas(desc='remove_punctuation')
    df['text'] = df['text'].progress_apply(remove_punctuation)
    tqdm.pandas(desc='remove_stopwords')
    df['text'] = df['text'].progress_apply(remove_stopwords)
    tqdm.pandas(desc='lemmatize')
    df['text'] = df['text'].progress_apply(lemmatize)
    return df

In [96]:
# Implement preprocessing functions
proc_r_df = r_df.copy()
proc_r_df = preprocess(r_df)

HBox(children=(IntProgress(value=0, description='remove_punctuation', max=4167461, style=ProgressStyle(descrip…

HBox(children=(IntProgress(value=0, description='remove_stopwords', max=4167461, style=ProgressStyle(descripti…

HBox(children=(IntProgress(value=0, description='lemmatize', max=4167461, style=ProgressStyle(description_widt…

Let's compare some (arbitrarily chosen) original reviews with their respective processed versions:

Original Review:

In [102]:
r_df.text.loc[23]

"I would say Emeralds has worsen over a period of few months, honestly, the service here was never the greatest, but now it has gotten to the point where the employees barely know any english! \n\nAlso, they served us with dumplings that have gone bad, and were very sour in taste.  Meh, I wouldn't really care however, they charged us for it?!? So I have to pay to eat dumplings that may prove to be a health concern?  \n\n...At least the washrooms were clean... :P"

Processed Version:

In [100]:
proc_r_df.text.loc[23]

'would say emerald worsen period month honestly service never great gotten point employee barely know english also serve us dumplings go bad sour taste meh wouldnt really care however charge us pay eat dumpling may prove health concern least washrooms clean p'

Original Review:

In [104]:
r_df.text.loc[579]

'After wanting to try Ravi soups out for months it did not disappoint. The curried lentil and apricot soup was the clear winner of everything we tried. We also tried the veggie wrap, pork wrap, and corn bisque.'

Processed Version:

In [105]:
proc_r_df.text.loc[579]

'want try ravi soups month disappoint curry lentil apricot soup clear winner everything try also try veggie wrap pork wrap corn bisque'

In [8]:
# Check for null values
pd.isna(proc_r_df['text']).sum()

1

In [9]:
# identify null value
proc_r_df[proc_r_df.text.isnull()]

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,...,stars_x,state,cool,date,funny,review_id,stars_y,text,useful,user_id
2023206,4000 W Flamingo Rd,"{'BusinessParking': ""{'garage': True, 'street'...",UXFLCGw1yOCajlMONYQM1w,"Buffets, Restaurants, Chinese",Las Vegas,"{'Monday': '16:0-21:0', 'Tuesday': '16:0-21:0'...",1.0,36.116635,-115.192905,Ports O' Call Buffet,...,3.0,NV,0.0,2015-02-14 14:10:24,0.0,ZY2DqyaJBzW3pUa9F0N5-Q,1.0,,1.0,yctYv9E1rK6AYTx1FHU3gw


In [10]:
# remove null value
proc_r_df = proc_r_df.loc[~pd.isna(proc_r_df['text'])]

In [24]:
# Check for null values
pd.isna(proc_r_df['text']).sum()

0

In [26]:
# Save processed data to csv file
proc_r_df.to_csv('processed_cleaned_restaurants_final.csv', index=False)

## 2. Vectorize Documents ##

In [2]:
# Import data as DataFrame
proc_r_df = pd.read_csv('/Users/dwalkerpage/Documents/Data_Science/Springboard/Projects/springboard/Capstone_Projects/Capstone_Project_2/Code/processed_cleaned_restaurants_final.csv')

In [8]:
proc_r_df.shape[0]/10

416746.0

To preserve time and computational efficiency, we will work with various sample sizes from our dataset. Determining the extent to which our results extend to the larger dataset could be a fruitful direction for future developments.

In [48]:
# Function for getting sample
def get_sample(df, sample_size=1000, random_state=7):
    return df.sample(n=sample_size, random_state=random_state)

In [49]:
# Function for tokenizing strings in an iterable
def get_tokens(iterable):
    return [string.split() for string in tqdm(iterable, desc='get_tokens')]

In [51]:
# Function to get feature dictionary where the key is the word and value is the number of times that word occurs in the entire corpus of documents
def get_feature_dict(word_tokens):
    return gensim.corpora.Dictionary(word_tokens)

In [52]:
# Function to construct Bag of Words representation of text
def get_bow(feature_dict, word_tokens):
    return [feature_dict.doc2bow(token) for token in tqdm(word_tokens, desc='get_bow')]

In [57]:
# Function to get feature_id to feature mapping (needed to inspect topics later)
def get_id2token(feature_dict):
    temp = feature_dict[0] # initialize feature_dict in memory
    return feature_dict.id2token

In [62]:
# Function to get sample, vectorize text data, and split vector representation into train/test sets
def lda_input(df, sample_size=1000, test_size=0.3, random_state=7):
    
    df_sample = get_sample(df, sample_size=sample_size, random_state=random_state)
    
    tokens = get_tokens(df_sample['text'])
    
    feature_dict = get_feature_dict(tokens)
    
    bow = get_bow(feature_dict, tokens)
    
    id2token_mapping = get_id2token(feature_dict)
    
    xtrain, xtest = train_test_split(bow, test_size=test_size, random_state=random_state)
    
    return df_sample, tokens, feature_dict, bow, id2token_mapping, xtrain, xtest

In [70]:
# SAMPLE SIZE 1_000

(proc_r_df_sample1_000,
 review_tokens1_000,
 feature_dict1_000,
 bow1_000,
 id2word1_000,
 xtrain1_000,
 xtest1_000) = lda_input(proc_r_df,
                         sample_size=1_000,
                         test_size=0.3,
                         random_state=7
                         )

HBox(children=(IntProgress(value=0, description='get_tokens', max=1000, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='get_bow', max=1000, style=ProgressStyle(description_width='in…




In [72]:
# SAMPLE SIZE 10_000

(proc_r_df_sample10_000,
 review_tokens10_000,
 feature_dict10_000,
 bow10_000,
 id2word10_000,
 xtrain10_000,
 xtest10_000) = lda_input(proc_r_df,
                         sample_size=10_000,
                         test_size=0.3,
                         random_state=7
                         )

HBox(children=(IntProgress(value=0, description='get_tokens', max=10000, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='get_bow', max=10000, style=ProgressStyle(description_width='i…




In [73]:
# SAMPLE SIZE 100_000

(proc_r_df_sample100_000,
 review_tokens100_000,
 feature_dict100_000,
 bow100_000,
 id2word100_000,
 xtrain100_000,
 xtest100_000) = lda_input(proc_r_df,
                         sample_size=100_000,
                         test_size=0.3,
                         random_state=7
                         )

HBox(children=(IntProgress(value=0, description='get_tokens', max=100000, style=ProgressStyle(description_widt…




HBox(children=(IntProgress(value=0, description='get_bow', max=100000, style=ProgressStyle(description_width='…




In [92]:
# SAMPLE SIZE 500_000

(proc_r_df_sample500_000,
 review_tokens500_000,
 feature_dict500_000,
 bow500_000,
 id2word500_000,
 xtrain500_000,
 xtest500_000) = lda_input(proc_r_df,
                         sample_size=500_000,
                         test_size=0.3,
                         random_state=7
                         )

HBox(children=(IntProgress(value=0, description='get_tokens', max=500000, style=ProgressStyle(description_widt…




HBox(children=(IntProgress(value=0, description='get_bow', max=500000, style=ProgressStyle(description_width='…




## 3. LDA Topic Models ##

In [61]:
help(LdaMulticore)

Help on class LdaMulticore in module gensim.models.ldamulticore:

class LdaMulticore(gensim.models.ldamodel.LdaModel)
 |  LdaMulticore(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None, minimum_probability=0.01, minimum_phi_value=0.01, per_word_topics=False, dtype=<class 'numpy.float32'>)
 |  
 |  An optimized implementation of the LDA algorithm, able to harness the power of multicore CPUs.
 |  Follows the similar API as the parent class :class:`~gensim.models.ldamodel.LdaModel`.
 |  
 |  Method resolution order:
 |      LdaMulticore
 |      gensim.models.ldamodel.LdaModel
 |      gensim.interfaces.TransformationABC
 |      gensim.utils.SaveLoad
 |      gensim.models.basemodel.BaseTopicModel
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, corpus=None, num_topics=100, id2word=None, workers=

### LDA Model with Sample Size 1,000 ###

In [82]:
%%time

lda1_000 = LdaMulticore(corpus=bow1_000,
                        id2word=id2word1_000,
                        num_topics=5,
                        workers=5,
                        chunksize=2000,
                        passes=1,
                        batch=False,
                        alpha='symmetric',
                        eta=None,
                        decay=0.5,
                        offset=1.0,
                        eval_every=None,
                        iterations=50,
                        gamma_threshold=0.001,
                        minimum_probability=0.01,
                        random_state=7
#                         minimum_phi_value=0.01,
#                         per_word_topics=False,
#                         dtype=<class 'numpy.float32'>,
                    )

CPU times: user 261 ms, sys: 126 ms, total: 387 ms
Wall time: 796 ms


In [85]:
lda1_000.show_topics(formatted=False)

[(0,
  [('food', 0.012662879),
   ('place', 0.011021428),
   ('great', 0.009415503),
   ('go', 0.0091834515),
   ('good', 0.008783108),
   ('order', 0.007936001),
   ('well', 0.007903132),
   ('time', 0.0076403674),
   ('like', 0.0065482096),
   ('make', 0.0065188073)]),
 (1,
  [('food', 0.01756897),
   ('get', 0.013004785),
   ('good', 0.012911806),
   ('go', 0.010328119),
   ('place', 0.009638864),
   ('service', 0.008195439),
   ('great', 0.007493089),
   ('like', 0.0061502405),
   ('come', 0.0056369337),
   ('try', 0.004802929)]),
 (2,
  [('good', 0.015943957),
   ('place', 0.012444894),
   ('food', 0.010193529),
   ('go', 0.009639071),
   ('get', 0.00842312),
   ('order', 0.0074743675),
   ('great', 0.007362948),
   ('time', 0.006679303),
   ('one', 0.006175509),
   ('also', 0.0061264876)]),
 (3,
  [('good', 0.011204761),
   ('get', 0.011032292),
   ('food', 0.008898975),
   ('come', 0.008082554),
   ('place', 0.0079306085),
   ('time', 0.0078652715),
   ('go', 0.007531253),
   ('

In [84]:
# See https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics
lda1_000.top_topics(bow1_000)

[([(0.011204761, 'good'),
   (0.011032292, 'get'),
   (0.008898975, 'food'),
   (0.008082554, 'come'),
   (0.0079306085, 'place'),
   (0.0078652715, 'time'),
   (0.007531253, 'go'),
   (0.007052512, 'service'),
   (0.006443688, 'one'),
   (0.005591475, 'order'),
   (0.00520837, 'try'),
   (0.0050734486, 'great'),
   (0.0049117235, 'like'),
   (0.0045495923, 'would'),
   (0.004541352, 'meal'),
   (0.004498734, 'back'),
   (0.0042737816, 'say'),
   (0.0042591286, 'restaurant'),
   (0.0041963435, 'love'),
   (0.003724203, 'well')],
  -1.3023760981077013),
 ([(0.012662879, 'food'),
   (0.011021428, 'place'),
   (0.009415503, 'great'),
   (0.0091834515, 'go'),
   (0.008783108, 'good'),
   (0.007936001, 'order'),
   (0.007903132, 'well'),
   (0.0076403674, 'time'),
   (0.0065482096, 'like'),
   (0.0065188073, 'make'),
   (0.005822111, 'get'),
   (0.0056278403, 'try'),
   (0.0055285655, 'restaurant'),
   (0.005366005, 'come'),
   (0.0052771415, 'one'),
   (0.005037206, 'back'),
   (0.00495804

In [104]:
%%time

vis_data1_000 = pyLDAvis.gensim.prepare(lda1_000, bow1_000, feature_dict1_000)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


CPU times: user 5.68 s, sys: 400 ms, total: 6.08 s
Wall time: 1min 3s


In [109]:
pyLDAvis.display(vis_data1_000)

### LDA Model with Sample Size 10,000 ###

In [86]:
%%time

lda10_000 = LdaMulticore(corpus=bow10_000,
                        id2word=id2word10_000,
                        num_topics=5,
                        workers=5,
                        chunksize=2000,
                        passes=1,
                        batch=False,
                        alpha='symmetric',
                        eta=None,
                        decay=0.5,
                        offset=1.0,
                        eval_every=None,
                        iterations=50,
                        gamma_threshold=0.001,
                        minimum_probability=0.01,
                        random_state=7
#                         minimum_phi_value=0.01,
#                         per_word_topics=False,
#                         dtype=<class 'numpy.float32'>,
                    )

CPU times: user 1.16 s, sys: 316 ms, total: 1.47 s
Wall time: 1.75 s


In [87]:
lda10_000.show_topics(formatted=False)

[(0,
  [('food', 0.011527253),
   ('good', 0.010965304),
   ('place', 0.009626116),
   ('order', 0.008371312),
   ('like', 0.008136446),
   ('go', 0.0076647433),
   ('get', 0.0075358297),
   ('great', 0.007339271),
   ('one', 0.007181665),
   ('make', 0.0070636)]),
 (1,
  [('great', 0.0129324505),
   ('food', 0.012647134),
   ('place', 0.011795186),
   ('service', 0.010658088),
   ('come', 0.010007418),
   ('good', 0.008636398),
   ('like', 0.007939527),
   ('get', 0.007889032),
   ('go', 0.007082385),
   ('order', 0.005842592)]),
 (2,
  [('place', 0.01570077),
   ('get', 0.015632696),
   ('good', 0.015381798),
   ('food', 0.013805398),
   ('go', 0.012369021),
   ('time', 0.010384177),
   ('order', 0.009553988),
   ('like', 0.0070229047),
   ('great', 0.006190839),
   ('try', 0.005451274)]),
 (3,
  [('food', 0.015497324),
   ('good', 0.009655477),
   ('get', 0.009487278),
   ('go', 0.008981699),
   ('order', 0.008794666),
   ('place', 0.008135009),
   ('great', 0.007745277),
   ('one',

In [97]:
lda10_000.top_topics(bow10_000)

[([(0.015497324, 'food'),
   (0.009655477, 'good'),
   (0.009487278, 'get'),
   (0.008981699, 'go'),
   (0.008794666, 'order'),
   (0.008135009, 'place'),
   (0.007745277, 'great'),
   (0.0077106906, 'one'),
   (0.0075905817, 'service'),
   (0.0072685676, 'come'),
   (0.006808387, 'would'),
   (0.0062794867, 'love'),
   (0.0052507725, 'time'),
   (0.005134007, 'restaurant'),
   (0.005067099, 'like'),
   (0.005012744, 'try'),
   (0.0047140564, 'take'),
   (0.0045579616, 'back'),
   (0.0045004115, 'really'),
   (0.0044915625, 'say')],
  -1.2604571573723808),
 ([(0.011527253, 'food'),
   (0.010965304, 'good'),
   (0.009626116, 'place'),
   (0.008371312, 'order'),
   (0.008136446, 'like'),
   (0.0076647433, 'go'),
   (0.0075358297, 'get'),
   (0.007339271, 'great'),
   (0.007181665, 'one'),
   (0.0070636, 'make'),
   (0.0067745172, 'time'),
   (0.006141056, 'try'),
   (0.0060359407, 'well'),
   (0.0054661785, 'come'),
   (0.0050809113, 'restaurant'),
   (0.005049891, 'would'),
   (0.004944

In [107]:
%%time

vis_data10_000 = pyLDAvis.gensim.prepare(lda10_000, bow10_000, feature_dict10_000)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


CPU times: user 20.2 s, sys: 1.3 s, total: 21.5 s
Wall time: 3min 45s


In [108]:
pyLDAvis.display(vis_data10_000)

### LDA Model with Sample Size 100,000 ###

In [89]:
%%time

lda100_000 = LdaMulticore(corpus=bow100_000,
                        id2word=id2word100_000,
                        num_topics=5,
                        workers=5,
                        chunksize=2000,
                        passes=1,
                        batch=False,
                        alpha='symmetric',
                        eta=None,
                        decay=0.5,
                        offset=1.0,
                        eval_every=None,
                        iterations=50,
                        gamma_threshold=0.001,
                        minimum_probability=0.01,
                        random_state=7
#                         minimum_phi_value=0.01,
#                         per_word_topics=False,
#                         dtype=<class 'numpy.float32'>,
                    )

CPU times: user 24.4 s, sys: 3.69 s, total: 28.1 s
Wall time: 17.8 s


In [90]:
lda100_000.show_topics(formatted=False)

[(0,
  [('food', 0.010301965),
   ('like', 0.009722904),
   ('good', 0.009251022),
   ('order', 0.00888321),
   ('place', 0.007920678),
   ('come', 0.007660559),
   ('restaurant', 0.0073490497),
   ('one', 0.007248193),
   ('try', 0.0070958445),
   ('chicken', 0.006795262)]),
 (1,
  [('go', 0.015959743),
   ('place', 0.013255054),
   ('order', 0.012505443),
   ('get', 0.011103365),
   ('time', 0.009243421),
   ('good', 0.008764825),
   ('food', 0.007326077),
   ('back', 0.0072726062),
   ('say', 0.0071426025),
   ('like', 0.007008652)]),
 (2,
  [('food', 0.022347925),
   ('service', 0.013905951),
   ('good', 0.011932319),
   ('time', 0.010544484),
   ('place', 0.009370003),
   ('get', 0.009047121),
   ('come', 0.008830168),
   ('like', 0.008543043),
   ('order', 0.007943083),
   ('go', 0.007864215)]),
 (3,
  [('good', 0.011593052),
   ('place', 0.01114119),
   ('food', 0.010100398),
   ('fry', 0.010067904),
   ('love', 0.009506537),
   ('delicious', 0.0076631643),
   ('burger', 0.00749

In [96]:
lda100_000.top_topics(bow100_000)

[([(0.015959743, 'go'),
   (0.013255054, 'place'),
   (0.012505443, 'order'),
   (0.011103365, 'get'),
   (0.009243421, 'time'),
   (0.008764825, 'good'),
   (0.007326077, 'food'),
   (0.0072726062, 'back'),
   (0.0071426025, 'say'),
   (0.007008652, 'like'),
   (0.0067377877, 'great'),
   (0.0066067227, 'one'),
   (0.006369754, 'take'),
   (0.0063180914, 'would'),
   (0.0062427013, 'come'),
   (0.005867725, 'try'),
   (0.0058167162, 'make'),
   (0.005640407, 'wait'),
   (0.0056004473, 'service'),
   (0.0053679347, 'love')],
  -1.2207290075104114),
 ([(0.010301965, 'food'),
   (0.009722904, 'like'),
   (0.009251022, 'good'),
   (0.00888321, 'order'),
   (0.007920678, 'place'),
   (0.007660559, 'come'),
   (0.0073490497, 'restaurant'),
   (0.007248193, 'one'),
   (0.0070958445, 'try'),
   (0.006795262, 'chicken'),
   (0.0067417813, 'make'),
   (0.0066879326, 'get'),
   (0.0057606213, 'well'),
   (0.0056054113, 'go'),
   (0.0055619352, 'dish'),
   (0.0055553825, 'sauce'),
   (0.005524473

### LDA Model with Sample Size 500,000 ###

In [93]:
%%time

lda500_000 = LdaMulticore(corpus=bow500_000,
                        id2word=id2word500_000,
                        num_topics=5,
                        workers=5,
                        chunksize=2000,
                        passes=1,
                        batch=False,
                        alpha='symmetric',
                        eta=None,
                        decay=0.5,
                        offset=1.0,
                        eval_every=None,
                        iterations=50,
                        gamma_threshold=0.001,
                        minimum_probability=0.01,
                        random_state=7
#                         minimum_phi_value=0.01,
#                         per_word_topics=False,
#                         dtype=<class 'numpy.float32'>,
                    )

CPU times: user 4min 28s, sys: 44.7 s, total: 5min 13s
Wall time: 1min 53s


In [98]:
lda500_000.show_topics(formatted=False)

[(0,
  [('good', 0.012664083),
   ('food', 0.011850028),
   ('place', 0.010963864),
   ('dish', 0.009526698),
   ('like', 0.008637138),
   ('restaurant', 0.008554481),
   ('chicken', 0.008153795),
   ('order', 0.008004137),
   ('get', 0.007243999),
   ('try', 0.006849711)]),
 (1,
  [('good', 0.01482968),
   ('get', 0.010676769),
   ('order', 0.009730199),
   ('burger', 0.009706478),
   ('fry', 0.009368991),
   ('like', 0.00907249),
   ('cheese', 0.008136843),
   ('go', 0.0077758585),
   ('come', 0.007360467),
   ('chicken', 0.0068593426)]),
 (2,
  [('get', 0.014071949),
   ('go', 0.012654565),
   ('pizza', 0.010407035),
   ('place', 0.009489011),
   ('order', 0.0092026265),
   ('say', 0.008244008),
   ('one', 0.00818839),
   ('like', 0.007869867),
   ('time', 0.0070830677),
   ('food', 0.0067708977)]),
 (3,
  [('food', 0.030980019),
   ('service', 0.020788107),
   ('time', 0.017565804),
   ('great', 0.014718281),
   ('come', 0.014308248),
   ('order', 0.014292425),
   ('good', 0.013576

In [99]:
lda500_000.top_topics(bow500_000)

[([(0.014071949, 'get'),
   (0.012654565, 'go'),
   (0.010407035, 'pizza'),
   (0.009489011, 'place'),
   (0.0092026265, 'order'),
   (0.008244008, 'say'),
   (0.00818839, 'one'),
   (0.007869867, 'like'),
   (0.0070830677, 'time'),
   (0.0067708977, 'food'),
   (0.00642899, 'would'),
   (0.0060832435, 'take'),
   (0.005997097, 'back'),
   (0.0059471675, 'come'),
   (0.0059353495, 'make'),
   (0.005674911, 'dont'),
   (0.005482478, 'ask'),
   (0.005337891, 'us'),
   (0.004909312, 'good'),
   (0.0048892917, 'even')],
  -1.2450687883807496),
 ([(0.01482968, 'good'),
   (0.010676769, 'get'),
   (0.009730199, 'order'),
   (0.009706478, 'burger'),
   (0.009368991, 'fry'),
   (0.00907249, 'like'),
   (0.008136843, 'cheese'),
   (0.0077758585, 'go'),
   (0.007360467, 'come'),
   (0.0068593426, 'chicken'),
   (0.006534986, 'place'),
   (0.0063788146, 'food'),
   (0.006268972, 'try'),
   (0.006090687, 'sauce'),
   (0.0058407905, 'taste'),
   (0.0056140553, 'one'),
   (0.005470538, 'salad'),
   

## 4. Conclusion ##

Now that we have developed some basic LDA models with various sample sizes, we can move to the extended modeling stage of the project. In the extended modeling stage, we will try to refine one or more LDA models by clarifying a method for evaluating the models and improving the models according to that standard of evaluation.

**Ideas for Next Steps**
* Develop method for evaluating the lda model results in line with the overarching project goals.
* Use bigrams (or ngrams for some n) as features instead of individual tokens.
* Use the filter_extremes method to remove infrequent and extremely frequent tokens (see [here](https://radimrehurek.com/gensim/corpora/dictionary.html)).
* Try different variations for the lda model parameters (see below).
* Construct visualizations of topic models using pyLDAvis, and explore the meaning of these visualizations.

**Possible LDA Parameter Variations to Try**
* *Sample Size*: [1_000, 10_000, 100_000]
* *num_topics*: [5, 10, 25, 50, 100], default=100
* *passes*: default=1
* *alpha*: ['symmetric', 'auto'], default='symmetric'
* *eta*: [None, 'auto'], default=None
* *decay*: default=0.5
* *offset*: default=1.0
* *eval_every*: [None, 10], default=10
* *iterations*: default=50
* *gamma_threshold*: default=0.001
* *minimum_probability*: [0.01, 0.1, 0.25, 0.5, 0.75, 0.9], default=0.01

**Sources Used for Guidance**
* https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py
* https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
* https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28
* https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
* https://simonhessner.de/lemmatize-whole-sentences-with-python-and-nltks-wordnetlemmatizer/
* https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
* https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
* https://kite.com/python/docs/nltk.pos_tag
* https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
* https://datascience.blog.wzb.eu/2016/06/17/creating-a-sparse-document-term-matrix-for-topic-modeling-via-lda/