## LDA using testing using smaller data:

In [2]:
movies_df = pd.read_csv('sentiment/movies.csv')
movies_doc = movies_df.text.to_list()

In [3]:
not_alphanumeric_or_space = re.compile('[^(\w|\s|\d)]')

CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_multiple_whitespaces,
                 strip_numeric, remove_stopwords, strip_short,stem_text]

def preprocess(doc): 
    words = preprocess_string(doc, filters=CUSTOM_FILTERS)
    doc = ' '.join(words).lower()
    return re.sub(not_alphanumeric_or_space, '', doc)

def docs_prep(docs):
    processed_docs = []
    for i in docs:
        movies_vec = preprocess(i)
        processed_docs.append(movies_vec)  
    new_prep = []
    for i in processed_docs:
        to_add = i.split()
        new_prep.append(to_add)
    return new_prep

from sqlalchemy import create_engine
import pandas as pd
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.models import LdaModel
from gensim.matutils import corpus2csc, Sparse2Corpus
from gensim.models.ldamulticore import LdaMulticore
import re
from gensim.parsing.preprocessing import preprocess_string, split_alphanum, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric,remove_stopwords ,strip_short ,stem_text
from pprint import pprint

def sql_to_list_converter(chunks):
    preprocessed_texts = []
    for i in chunks:
        
        #have to split each sentence since this is the input for gensim
        list_to_add = [iterator.split() for iterator in i.preprocessed_text.tolist()]
        preprocessed_texts += list_to_add
        
    return preprocessed_texts

def tf_idf_vect(preprocessed_texts):
    dictionary = Dictionary(preprocessed_texts)
    
    # remove words that appear in less than 20 documents, or more than 50% of the documents.
    dictionary.filter_extremes(no_below=20, no_above=0.5)

    num_docs = dictionary.num_docs
    num_terms = len(dictionary.keys())
    
    #transform into bow
    corpus_bow = [dictionary.doc2bow(doc) for doc in preprocessed_texts]
    
    #transform into tf-idf:
    tfidf = TfidfModel(corpus_bow) #,normalize = True)
    corpus_tfidf = tfidf[corpus_bow]
    
    #transform into sparse matrix:
    corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms,num_docs=num_docs)
    
                       
    #NB: After reading online, LDA works just as well and even better in some case
    #with corpus_tfidf, so I can use the tfidf instead
    #NB: need to return dictionary to use in the LDA model
    
    return dictionary, corpus_tfidf, corpus_bow #, corpus_tfidf_sparse

In [4]:
doccc = docs_prep(movies_doc)

In [45]:
%%time

dic, corpus_tfidf, corpus_bow = tf_idf_vect(doccc)

CPU times: user 17 s, sys: 196 ms, total: 17.2 s
Wall time: 17.2 s


### Using corpus tfidf:

In [48]:
corpus_tfidf[0]

[(0, 0.06438062249818663),
 (1, 0.05331512447771581),
 (2, 0.129758482602989),
 (3, 0.0465760545473316),
 (4, 0.09779651860947264),
 (5, 0.0768311020043291),
 (6, 0.044647079364953546),
 (7, 0.14869087541838666),
 (8, 0.0594149220504515),
 (9, 0.058693449766686084),
 (10, 0.0752940831920687),
 (11, 0.09269158016746293),
 (12, 0.12637604504456942),
 (13, 0.05548940236173458),
 (14, 0.048119773530762536),
 (15, 0.11563755034289788),
 (16, 0.15284337676394705),
 (17, 0.05204048052623202),
 (18, 0.08755548468874456),
 (19, 0.09358987959638854),
 (20, 0.07299980971594074),
 (21, 0.059542978082551205),
 (22, 0.09844101530493939),
 (23, 0.07107804724178875),
 (24, 0.05376809713533608),
 (25, 0.06124211612359264),
 (26, 0.03723792415725086),
 (27, 0.04855951421759549),
 (28, 0.0757269293184007),
 (29, 0.04535985563850467),
 (30, 0.06956530146846655),
 (31, 0.05710618834281626),
 (32, 0.0999121663575114),
 (33, 0.04148685322345651),
 (34, 0.04822475480586997),
 (35, 0.037850811440503555),
 (36,

In [47]:
%%time

# Set training parameters.
num_topics = 50
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dic[0]  # This is only to "load" the dictionary.
id2word = dic.id2token

model = LdaMulticore(
    corpus=corpus_tfidf,
    id2word=id2word,
    chunksize=chunksize,
    workers=None,
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

CPU times: user 6min 33s, sys: 4.2 s, total: 6min 37s
Wall time: 6min 36s


#### Getting the topics and their probabilities for one document:

In [61]:
x = model.get_document_topics(corpus_tfidf[0], minimum_probability=None)
x

[(8, 0.83751667), (16, 0.06718511)]

#### Getting the words describing the topics for the same document:

In [62]:
for i in x:
    print(model.show_topic(i[0], topn=20))

[('charact', 0.0022977034), ('stori', 0.0022455796), ('like', 0.0021126294), ('time', 0.0019672627), ('good', 0.0019669973), ('love', 0.001953449), ('scene', 0.0019352934), ('great', 0.0019019066), ('watch', 0.0018573694), ('plai', 0.0017727332), ('peopl', 0.0017476061), ('look', 0.0017047834), ('end', 0.0016921982), ('wai', 0.0016395828), ('think', 0.0016117092), ('life', 0.0016099963), ('work', 0.0015607986), ('actor', 0.001550181), ('act', 0.0015447924), ('year', 0.0015330676)]
[('jacki', 0.007437012), ('chan', 0.0073808166), ('kung', 0.005362623), ('stan', 0.0052415095), ('laurel', 0.005238125), ('kong', 0.004725162), ('hong', 0.004513635), ('hardi', 0.004380598), ('martial', 0.0039120535), ('wire', 0.0031765273), ('grier', 0.0030849183), ('olli', 0.002839588), ('homer', 0.0028322577), ('pam', 0.0028288807), ('monologu', 0.0027881463), ('greed', 0.0027833104), ('ebert', 0.0027241746), ('sinatra', 0.002718796), ('lamb', 0.0027081873), ('quinn', 0.0027063782)]


Note: when using tfidf vector, the topics are almost always topic 8, which is related to scenes and watching and movies... makes sense since the documents are movie reviews.

## Using corpus bow:

In [63]:
%%time

# Set training parameters.
num_topics = 50
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dic[0]  # This is only to "load" the dictionary.
id2word = dic.id2token

model = LdaMulticore(
    corpus=corpus_bow,
    id2word=id2word,
    chunksize=chunksize,
    workers=None,
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

CPU times: user 3min 55s, sys: 12.8 s, total: 4min 8s
Wall time: 4min 19s


Compared to using only corpus bow, using the tfidf is achieved in a faster time.

#### Getting the topics with the topn words that describe them:

In [64]:
%%time

top_topics = model.top_topics(corpppp, topn=5)
#pprint(top_topics)

CPU times: user 753 ms, sys: 22 ms, total: 775 ms
Wall time: 774 ms


This shows the top 100 topics, with the 5 words that, with highest probability, represent those topics.

#### Getting the topics and their probabilities for one document:

In [65]:
x = model.get_document_topics(corpppp[0], minimum_probability=0.1)
x

[(7, 0.13438915), (22, 0.12401116), (34, 0.2252998)]

This shows how document 0 (represented in terms of BOW as corpppp[0]) has its topics, with the probability of each topic. In this case, the document corresponds to topic 13 with 0.164 probability, and topic 40 with 0.136 probability.

#### Getting the words describing the topics for the same document:

In [66]:
for i in x:
    print(model.show_topic(i[0], topn=10))

[('school', 0.013332094), ('boi', 0.008467523), ('girl', 0.008024706), ('like', 0.006921068), ('children', 0.0065888925), ('student', 0.00607882), ('high', 0.0058694812), ('kid', 0.0055167074), ('parent', 0.0051464), ('child', 0.004577135)]
[('like', 0.013654727), ('holm', 0.0132337585), ('sex', 0.013080628), ('porn', 0.010436354), ('look', 0.008345899), ('scene', 0.007331529), ('shirlei', 0.0050096326), ('bone', 0.004902359), ('ellen', 0.004652755), ('jennif', 0.004401875)]
[('horror', 0.023603853), ('bad', 0.02096437), ('like', 0.016407525), ('look', 0.01231438), ('good', 0.012066251), ('effect', 0.011218209), ('act', 0.010566096), ('watch', 0.009061358), ('scene', 0.0086893225), ('budget', 0.007922259)]


This shows for document 0, the top 10 words relating to its topic. Recall we had two topics for document 0, so the first list is the top 10 words that explain the first topic, and the second list is for the second topic.

## TFIDF VS CORPUS BOW:

In [67]:
movies_doc[0]

'\'P\' (or Club-P) should really be called \'L\' for lame. Every festival has a disappointment and this is the one that fails to live up to its much-hyped logline: "Thai lesbians fighting monsters." Rather, this is the tale of a Khmer country girl who\'s grandmother has taught her a little witchcraft along with a few odd (but specific) rules: "don\'t walk under a clothesline," "don\'t eat raw meat," and "don\'t accept money for your powers." Well, guess what folks, the girl moves to Bangkok to raise some money as a \'bar-girl\' and manages to break all the rules granny taught her which subsequently releases an evil spirit that conveniently kills the \'foreign johns\' who pay for her services.<br /><br />While this film can\'t even be released in Thailand due to it\'s controversial subject matter most American audiences will find this ho-hum horror pic a cross between "Showgirls" and "Interview with the Vampire" as directed by Walt Disney.<br /><br />If not for a few scenes with signifi

Using the tfidf vs normal bow has some differences. Using tfidf runs in 6min30seconds compared to 4minutes14seconds for corpus bow.

When checking for the first document, the normal bow makes more sense in terms of topics relating to each document, compared to tfidf which identifies most documents as having the same topic; movies and series... This is true, but we are more interested in the individual topics, and tfidf does a good job to giving importance to the words that appear in most documents, in this case movies appears in most documents, but this does not help us identify the sub topic for each document; porn in movies vs children/highschool... in movies. Using normal bow is better at this.