# Topic Modeling Using LDA

Description: Topic Modeling is a type of statistical modeling for discovering the abstract topics that occur in a collection of documents.

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

## Import Libraries

### Main Libraries

In [1]:
import pandas as pd

### NLP Libraries

In [2]:
import gensim
from gensim import corpora, models
from pprint import pprint
from nltk.tokenize import word_tokenize

## Load Dataset

In [3]:
# Load dataset
df = pd.read_csv('C:/Users/cherryb/Desktop/Personal Projects/Datasets/Telus - Fintech/cleaned/fullstatCleaned_withLabels.tsv', index_col='Unnamed: 0', sep='\t')
# Inspect df
df.head()

Unnamed: 0,by,category,comment_likes_count,comments_base,comments_count_fb,comments_replies,comments_retrieved,engagement_fb,likes_count_fb,post_id,...,post_published,rea_ANGRY,rea_HAHA,rea_LOVE,rea_SAD,rea_THANKFUL,rea_WOW,reactions_count_fb,shares_count_fb,type
0,post_page_155027942462,App Update,0,0,11,0,0,22,11,155027942462_10157348020627463,...,2019-06-27T14:30:13+0000,0,0,0,0,0,0,11,0,photo
1,post_page_155027942462,Engagement,0,0,30,0,0,110,70,155027942462_10157333387457463,...,2019-06-21T14:31:06+0000,1,0,4,0,0,1,76,4,photo
2,post_page_155027942462,Engagement,0,0,11,0,0,17,5,155027942462_10157330985232463,...,2019-06-20T15:00:28+0000,1,0,0,0,0,0,6,0,photo
3,post_page_155027942462,Engagement,0,0,10,0,0,19,8,155027942462_10157323881577463,...,2019-06-17T14:30:11+0000,1,0,0,0,0,0,9,0,photo
4,post_page_155027942462,Engagement,0,0,8,0,0,17,6,155027942462_10157315990422463,...,2019-06-14T14:30:02+0000,3,0,0,0,0,0,9,0,video


In [4]:
# Tokenize post_message column
df['post_message'] = df['post_message'].apply(lambda list_words: word_tokenize(list_words))

## Data Pre-processing

In [5]:
# Select df['post_message']
processed_words = df['post_message']

In [6]:
# Create a list of processed_words
processed_words = [words for words in processed_words]

### Bag of Word (BoW)

In [7]:
# Create a dictionary from ‘processed_words’ containing the number of times a word appears in the training set
dictionary = corpora.Dictionary(processed_words)

In [8]:
# Filter out tokens
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [9]:
# Create a dictionary reporting how many words and how many times those words appear
corpus_bow = [dictionary.doc2bow(doc) for doc in processed_words]

In [10]:
# Preview Bag of Words for a sample preprocessed document
bow_doc_500 = corpus_bow[500]

for i in range(len(bow_doc_500)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_500[i][0], dictionary[bow_doc_500[i][0]], bow_doc_500[i][1]))

Word 57 ("paydai") appears 1 time.
Word 68 ("see") appears 1 time.
Word 108 ("week") appears 1 time.
Word 118 ("someon") appears 1 time.
Word 119 ("bonu") appears 1 time.
Word 163 ("winner") appears 1 time.


### TFIDF

In [11]:
# Fit model
tfidf = models.TfidfModel(corpus_bow)

In [12]:
# Apply transformation to the entire corpus
corpus_tfidf = tfidf[corpus_bow]

In [13]:
# Preview TF-IDF scores for the first document
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.8057762139865111),
 (1, 0.3489304433755792),
 (2, 0.39392744378681527),
 (3, 0.2716494205605292)]


## Topic Modeling - LDA

### Using BoW

In [14]:
# Train the lda model using gensim.models.LdaMulticore and corpus_bow
lda_model = gensim.models.LdaMulticore(corpus_bow, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [15]:
# For each topic, explore the words occuring in that topic and its relative weight.
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}\n'.format(idx, topic))

Topic: 0 
Words: 0.059*"monei" + 0.040*"save" + 0.034*"green" + 0.032*"dot" + 0.030*"deposit" + 0.029*"learn" + 0.028*"get" + 0.028*"direct" + 0.026*"win" + 0.022*"time"

Topic: 1 
Words: 0.083*"dot" + 0.083*"green" + 0.040*"like" + 0.030*"page" + 0.028*"app" + 0.027*"win" + 0.027*"mobil" + 0.026*"sweepstak" + 0.024*"see" + 0.024*"u"

Topic: 2 
Words: 0.081*"card" + 0.072*"cash" + 0.066*"back" + 0.046*"green" + 0.045*"dot" + 0.041*"visa" + 0.040*"debit" + 0.037*"fee" + 0.034*"appli" + 0.029*"annual"

Topic: 3 
Words: 0.047*"card" + 0.039*"green" + 0.038*"dot" + 0.027*"like" + 0.027*"us" + 0.026*"fee" + 0.022*"wai" + 0.021*"appli" + 0.020*"back" + 0.020*"help"

Topic: 4 
Words: 0.064*"see" + 0.062*"credit" + 0.060*"paydai" + 0.059*"bonu" + 0.056*"week" + 0.046*"winner" + 0.042*"card" + 0.035*"secur" + 0.035*"schedul" + 0.034*"dot"

Topic: 5 
Words: 0.083*"get" + 0.066*"pai" + 0.037*"deposit" + 0.036*"period" + 0.024*"mai" + 0.023*"dai" + 0.023*"card" + 0.023*"time" + 0.020*"direct" + 0.

### Using TFIDF

In [16]:
# Train the lda model using gensim.models.LdaMulticore and corpus_tfidf
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [17]:
# For each topic, explore the words occuring in that topic and its relative weight.
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} \nWord: {}\n'.format(idx, topic))

Topic: 0 
Word: 0.052*"wai" + 0.050*"save" + 0.029*"love" + 0.029*"monei" + 0.024*"tip" + 0.023*"plan" + 0.023*"on" + 0.022*"summer" + 0.019*"smart" + 0.018*"stai"

Topic: 1 
Word: 0.034*"person" + 0.032*"someon" + 0.031*"ontheblog" + 0.030*"mobil" + 0.029*"app" + 0.028*"monei" + 0.023*"see" + 0.019*"balanc" + 0.018*"dai" + 0.017*"mondaymotiv"

Topic: 2 
Word: 0.042*"read" + 0.033*"schedul" + 0.030*"ssi" + 0.029*"pai" + 0.029*"season" + 0.028*"payment" + 0.026*"get" + 0.023*"bill" + 0.022*"favorit" + 0.019*"secur"

Topic: 3 
Word: 0.047*"cash" + 0.044*"back" + 0.039*"card" + 0.031*"debit" + 0.030*"fee" + 0.030*"visa" + 0.027*"appli" + 0.024*"bank" + 0.024*"annual" + 0.024*"dot"

Topic: 4 
Word: 0.033*"budget" + 0.028*"good" + 0.025*"dai" + 0.025*"thank" + 0.023*"list" + 0.022*"paid" + 0.022*"read" + 0.021*"winner" + 0.020*"see" + 0.019*"get"

Topic: 5 
Word: 0.035*"famili" + 0.026*"back" + 0.025*"make" + 0.024*"card" + 0.024*"cash" + 0.022*"fun" + 0.022*"credit" + 0.021*"start" + 0.019

## Classification of the Topics

### Performance evaluation by classifying sample document using LDA Bag of Words model

In [18]:
processed_words[500]

['someon', 'cloud', 'nine', 'see', 'week', 'paydai', 'bonu', 'winner']

In [19]:
for index, score in sorted(lda_model[corpus_bow[500]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.871422529220581	 
Topic: 0.064*"see" + 0.062*"credit" + 0.060*"paydai" + 0.059*"bonu" + 0.056*"week" + 0.046*"winner" + 0.042*"card" + 0.035*"secur" + 0.035*"schedul" + 0.034*"dot"

Score: 0.014287102967500687	 
Topic: 0.050*"u" + 0.049*"comment" + 0.041*"share" + 0.038*"tell" + 0.033*"financ" + 0.031*"love" + 0.030*"tip" + 0.025*"person" + 0.022*"summer" + 0.019*"entrant"

Score: 0.014286868274211884	 
Topic: 0.044*"back" + 0.040*"new" + 0.039*"tell" + 0.036*"u" + 0.036*"financi" + 0.034*"comment" + 0.033*"dai" + 0.032*"get" + 0.030*"cash" + 0.023*"budget"

Score: 0.014286590740084648	 
Topic: 0.064*"read" + 0.052*"financi" + 0.040*"u" + 0.037*"famili" + 0.029*"month" + 0.029*"learn" + 0.029*"good" + 0.029*"know" + 0.027*"mondaymotiv" + 0.027*"see"

Score: 0.014286577701568604	 
Topic: 0.083*"dot" + 0.083*"green" + 0.040*"like" + 0.030*"page" + 0.028*"app" + 0.027*"win" + 0.027*"mobil" + 0.026*"sweepstak" + 0.024*"see" + 0.024*"u"

Score: 0.014286219142377377	 
Topic: 0.049*

### Performance evaluation by classifying sample document using LDA TF-IDF model

In [20]:
for index, score in sorted(lda_model_tfidf[corpus_bow[500]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.8714191317558289	 
Topic: 0.046*"week" + 0.042*"financi" + 0.040*"bonu" + 0.040*"winner" + 0.038*"paydai" + 0.033*"see" + 0.030*"check" + 0.027*"learn" + 0.026*"free" + 0.020*"win"

Score: 0.014289558865129948	 
Topic: 0.034*"person" + 0.032*"someon" + 0.031*"ontheblog" + 0.030*"mobil" + 0.029*"app" + 0.028*"monei" + 0.023*"see" + 0.019*"balanc" + 0.018*"dai" + 0.017*"mondaymotiv"

Score: 0.014286858029663563	 
Topic: 0.051*"dot" + 0.050*"green" + 0.029*"share" + 0.027*"comment" + 0.026*"entrant" + 0.022*"budget" + 0.020*"credit" + 0.019*"extra" + 0.019*"happi" + 0.019*"u"

Score: 0.014286745339632034	 
Topic: 0.033*"budget" + 0.028*"good" + 0.025*"dai" + 0.025*"thank" + 0.023*"list" + 0.022*"paid" + 0.022*"read" + 0.021*"winner" + 0.020*"see" + 0.019*"get"

Score: 0.014286553487181664	 
Topic: 0.047*"cash" + 0.044*"back" + 0.039*"card" + 0.031*"debit" + 0.030*"fee" + 0.030*"visa" + 0.027*"appli" + 0.024*"bank" + 0.024*"annual" + 0.024*"dot"

Score: 0.014286398887634277	 
Top

-----------------

## Transform topics into Features

In [21]:
def get_listScore(topic):
    '''
    Get scores and put it in a list
    e.g. [1,2,3,4]
    
    Note: There are corpus that does not have a topic,
    so it gets an error of IndexError, to combat the IndexError,
    append 0 instead.
    '''
    scores = []
    for i in range(len(corpus_bow)):
        try:
            score = lda_model_tfidf[corpus_bow[i]][topic][1]
            scores.append(score)
        except IndexError:
            scores.append(0)
    return scores

In [22]:
# Create empty dictionary_scores
dictionary_scores = {}

# Set dictionary
numTopics = 10

for topic in range(numTopics):
    key = 'Topic ' + str(topic)
    dictionary_scores[key] = get_listScore(topic)

In [23]:
# Convert dictionary to dataFrame
topic_features = pd.DataFrame(dictionary_scores)
topic_features.head()

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
0,0.016667,0.016667,0.601591,0.016667,0.265269,0.016667,0.016668,0.016668,0.016668,0.016668
1,0.699907,0.033349,0.03334,0.033338,0.033356,0.033345,0.033342,0.033342,0.033339,0.033343
2,0.774977,0.025003,0.025002,0.025,0.025001,0.025,0.025009,0.025003,0.025001,0.025003
3,0.02501,0.025006,0.025004,0.025001,0.025005,0.025007,0.774959,0.025002,0.025002,0.025005
4,0.887463,0.012505,0.012502,0.012511,0.012505,0.012502,0.012502,0.012502,0.012503,0.012504


### Save as topic_features.tsv

In [24]:
topic_features.to_csv('C:/Users/cherryb/Desktop/Personal Projects/Datasets/Telus - Fintech/results/topic_features.tsv', sep='\t')