# Topic Modeling
Topic modeling is an unsupervised method that involves both dimensionality reduction and clustering text data into meaningful topics. There are several common approaches for text data. One such approach is **LSA (Latent Semantic Analysis)**. The core idea behind LSA is to take a matrix that already contains documents (tweets) and words, then decompose those into two matrices, one document (tweet) to topic matrix, and one a topic to word matrix. This allows one to find similiarity of different documents and different words. However, there are several downsides to using LSA. One is a lack of interpretability compared to the other approaches I will discuss. LSA also needs a large set of documents for accurate results. 

Another approach to discuss is **PLSA (Probabilistic latent Semantic Analysis)**. PLSA adds a probabistic spin to LSA. So each document (tweet) will have a probability of belonging in each category. However, PLSA is prone to ovefitting because it scales linearly as number of parameters increase. 

The approach used for the project will be **LDA (Latent Dirichlet Allocation)**. LDA is really just a Bayesian version of PLSA. The benefit of modeling with LDA is that it generalizes the best of the 3 approaches. This has to do with the two parameters alpha and beta which take advantage of dirichlet distributions. A deeper discussion on dirichlet distributions can be found [here](https://medium.com/@souravboss.bose/comprehensive-topic-modelling-with-nmf-lsa-plsa-lda-lda2vec-part-2-e3921e712f11). LDA will be choosen for this project mainly for its ability generalize better than the alternatives. 

The entire project repo can be found [here](https://github.com/tarrantcarter/Final_Capstone). Step 1) [Web Scraping](https://github.com/tarrantcarter/Final_Capstone/blob/main/Web_Scrape_Motivational_Tweet_Generator.ipynb), Step 2) [Data Preprocessing](https://github.com/tarrantcarter/Final_Capstone/blob/main/Data_Preprocessing_Motivational_Tweet_Generator.ipynb), Step 3) [Modeling](https://github.com/tarrantcarter/Final_Capstone/blob/main/Modeling_Motivational_Tweet_Generator.ipynb), Step 4) [Optimized Model + Deployment](https://github.com/tarrantcarter/Final_Capstone/blob/main/Optimal_Deployed_Motivational_Tweet_Generator.ipynb)

In [None]:
!pip install pyLDAvis --quiet
!pip install chart_studio --quiet


In [None]:
import pandas as pd
import numpy as np
import time
import re
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import LatentDirichletAllocation
import gensim
from spacy.tokenizer import Tokenizer
import gensim.corpora as corpora
from gensim.models.ldamulticore import LdaMulticore
from pprint import pprint
from gensim.models.coherencemodel import CoherenceModel
import plotly.express as px
import pyLDAvis.gensim
import chart_studio
import chart_studio.plotly as py 
import chart_studio.tools as tls
from operator import itemgetter
from ipywidgets import interact
import tqdm
from IPython.display import display, Markdown, clear_output
# widget packages
import ipywidgets as widgets


# supress warnings
warnings.filterwarnings("ignore")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We will load in our preprocessed tweets from our [data cleaning notebook](https://github.com/tarrantcarter/Final_Capstone/blob/main/Data_Preprocessing_Motivational_Tweet_Generator.ipynb). The csv can be found [here](https://drive.google.com/file/d/1-d-61YuocweY0F3rSnf-idL1ixbuKq4H/view?usp=sharing). 

In [None]:
# load in cleaned tweets from data cleaning notebook
tweets_cleaned = pd.read_json("/content/drive/MyDrive/Data/NLP_Capstone/motivational_tweets_cleaned.json")

In [None]:
tweets_cleaned.head()

Unnamed: 0,date,user_name,content,content_preprocessed,unigram_tokens,ngrams,ngram_tokens,nouns_only,nouns_verbs,bigrams_trigrams,bigrams_trigrams_strings
44,2021-01-17 22:13:17,LewisHowes,Know this. Everything is happening for a reaso...,know happen reason favor betterment future pai...,"[know, happen, reason, favor, betterment, futu...",know_happen_reason favor betterment future pai...,"[know_happen_reason, favor, betterment, future...","[reason, favor, betterment, future, pain, feel...","[know, reason, favor, betterment, future, pain...",[know_happen_reason],know_happen_reason
61,2021-01-15 15:28:06,LewisHowes,Protect your inner peace at all costs. Create ...,protect inner peace cost create daily practice...,"[protect, inner, peace, cost, create, daily, p...",protect inner_peace cost create daily_practice...,"[protect, inner_peace, cost, create, daily_pra...","[peace, cost, practice, communicate, stress, d...","[peace, cost, practice, communicate, stress, d...","[inner_peace, daily_practice]",inner_peace daily_practice
161,2021-01-07 16:00:29,LewisHowes,Always remember to ask for exactly what you wa...,remember ask exactly want ask love good health...,"[remember, ask, exactly, want, ask, love, good...",remember ask exactly want ask love good health...,"[remember, ask, exactly, want, ask, love, good...","[health, abundance, peace, ask, wisdom, creati...","[remember, health, abundance, peace, ask, wisd...","[health_abundance, ask_wisdom]",health_abundance ask_wisdom
274,2021-01-01 02:44:46,LewisHowes,This will be your greatest year ever. All your...,great year work start pay earn happy love deep...,"[great, year, work, start, pay, earn, happy, l...",great year work start pay earn happy love deep...,"[great, year, work, start, pay, earn, happy, l...","[year, work, start, earn, love, embrace, fear,...","[year, work, start, pay, earn, love, embrace, ...",[massive_action],massive_action
317,2020-12-30 01:38:35,LewisHowes,Be grateful for the breakdown this year. It's ...,grateful breakdown year set massive breakthrou...,"[grateful, breakdown, year, set, massive, brea...",grateful breakdown year set massive breakthrou...,"[grateful, breakdown, year, set, massive, brea...","[year, breakthrough, money, mission, get, rela...","[year, set, breakthrough, money, come, mission...",[real_friend],real_friend


In [None]:
tweets_cleaned.applymap(type)

Unnamed: 0,date,user_name,content,content_preprocessed,unigram_tokens,ngrams,ngram_tokens,nouns_only,nouns_verbs,bigrams_trigrams,bigrams_trigrams_strings
44,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
61,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
161,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
274,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
317,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
...,...,...,...,...,...,...,...,...,...,...,...
977598,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
977599,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
977606,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>
977608,<class 'pandas._libs.tslibs.timestamps.Timesta...,<class 'str'>,<class 'str'>,<class 'str'>,<class 'list'>,<class 'str'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'list'>,<class 'str'>


In [None]:
tweets_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 621449 entries, 44 to 977610
Data columns (total 11 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   date                      621449 non-null  datetime64[ns]
 1   user_name                 621449 non-null  object        
 2   content                   621449 non-null  object        
 3   content_preprocessed      621449 non-null  object        
 4   unigram_tokens            621449 non-null  object        
 5   ngrams                    621449 non-null  object        
 6   ngram_tokens              621449 non-null  object        
 7   nouns_only                621449 non-null  object        
 8   nouns_verbs               621449 non-null  object        
 9   bigrams_trigrams          621449 non-null  object        
 10  bigrams_trigrams_strings  621449 non-null  object        
dtypes: datetime64[ns](1), object(10)
memory usage: 56.9+ MB


In [None]:
tweets_cleaned.shape

(621449, 11)

# Base Topic Model
In the next steps the LDA model will be created and optimized, but I haven't metioned exactly how it will be scored and optimized. Perplexity and Coherence scores will be computed initially for a base model. Our anaylsis will focus mostly on Coherence score instead of perplexity. This is mostly because Coherence score matches closer to the human intuition of topics than perplexity does. For a further discussion on perplexity vs coherence score check out the link [here](http://qpleple.com/topic-coherence-to-evaluate-topic-models/). 

In [None]:
# create dictionary
id2word = corpora.Dictionary(tweets_cleaned['nouns_only'])
# create texts corpus
texts = tweets_cleaned['nouns_only']
# term document frequency
corpus = [id2word.doc2bow(text) for text in texts]
# print first 30 tuples from corpus
print(corpus[:1][0][:30])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]


In [None]:
# number of topics
num_topics = 10
# build LDA model
base_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

In [None]:
# filtering for words 
words = [re.findall(r'"([^"]*)"',t[1]) for t in base_model.print_topics()]

# create topic sorted by 10 most relevent words
topics = [' '.join(t[0:10]) for t in words]


# print most relevent words for each topic
for id, t in enumerate(topics): 
    print(f"------ Topic {id} ------")
    print(t, end="\n\n")

------ Topic 0 ------
life power robert desire courage care difference question limit amp

------ Topic 1 ------
thing change einstein life albert churchill winston control end get

------ Topic 2 ------
amp dream problem feel energy fear think solve tzu obstacle

------ Topic 3 ------
mind peace moment opportunity body passion life place act define

------ Topic 4 ------
dyer life success world work person amp way failure fall

------ Topic 5 ------
look choice franklin soul henry truth life benjamin david mean

------ Topic 6 ------
day matter judge treat trust ralph life emerson hold bonaparte

------ Topic 7 ------
reality action result strength napoleon word practice abundance self power

------ Topic 8 ------
time heart john love today tomorrow bhajan anger defeat aim

------ Topic 9 ------
man way proverb create step lincoln peale experience learn jim



In [None]:
# Compute Perplexity
## a measure of how good the model is. lower the better
base_perplexity = base_model.log_perplexity(corpus)
print('\nPerplexity: ', base_perplexity) 

# Compute Coherence Score
coherence_model = CoherenceModel(model=base_model, texts=tweets_cleaned['nouns_only'], 
                                   dictionary=id2word, coherence='c_v')
coherence_lda_model_base = coherence_model.get_coherence()
print('\nCoherence Score: ', coherence_lda_model_base)


Perplexity:  -6.94619580004618

Coherence Score:  0.6164812244561871


In [None]:
# topic distance visualization 
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(base_model, corpus, id2word)

This base model appears to be a great start. A coherence score of .61 will need to be compared to other strategies before commenting on how good it is. As can be seen in LDA visualization there is nice serperability between the 10 topics. Looking at some of the categoies some of the seem reasonable. Words like peace and mind seem like they would go togther, but it isn't clear how words like life and power go together. It should be noted that there are many first and last names appearing in the topics, but that isn't suprising since our n-grams are nouns. 

# Optimal Topic Model
In order to find the best model the focus will be on tuning 3 parameters (number of topics, alpha, and beta). Number of topics is self-explanatory. Alpha relates to document-topic density (scaled 0-1). The higher the alpha the more unique topic distribution per documents. Beta relates to topic-word density (scaled 0-1). The higher the beta the more unique word distribution per topics.

In addition to tuning these 3 parameters several different strategies will be explored as well. I will be evaluating the following approaches: no n-grams, n-grams only, nouns only n-grams, nouns and verbs only n-grams. 

The goal it to select the strategy that has the highest coherence score (metric - coherence score), most seperable topics (metric - eyeballing pyLDAvis for non-overlapping topics), and most explanatory topics (metric - subjectively evaluating the topic). It is unlikely that all 3 of the goals will be met by one strategy. So explanatory topics will be the prefered evalutation metric, because at the end of the day I want topics that make sense to the user. 

My hunch is that the noun and verb n-grams strategy will perfom best. Focusing on nouns makes it more likely the topic will be extracted from the text. But adding verbs makes sense because of the call-to-action nature of motivational tweets.


In [None]:
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           alpha=a,
                                           eta=b,
                                           random_state=222)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

In [None]:
# start preprocess runtime
start_time = time.time() 

# Topics range
min_topics = 3
max_topics = 10
step_size = 1
topics_range = range(min_topics, max_topics, step_size)
# Alpha parameter
alpha = [.01,.1,.5,1,'symmetric','asymmetric']
# Beta parameter
beta = [.01,.1,.5,1,'symmetric']

# model results dict
model_results = {'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }
# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=210)

    # iterate through number of topics
    for k in topics_range:
        # iterate through alpha values
        for a in alpha:
            # iterare through beta values
            for b in beta:
                # get the coherence score for the given parameters
                cv = compute_coherence_values(corpus=corpus, dictionary=id2word, 
                                              k=k, a=a, b=b)
                # Save the model results
                model_results['Topics'].append(k)
                model_results['Alpha'].append(a)
                model_results['Beta'].append(b)
                model_results['Coherence'].append(cv)
                
                pbar.update(1)
    lda_tuning_results = pd.DataFrame(model_results)                
    lda_tuning_results.to_csv('/content/drive/MyDrive/Data/NLP_Capstone/lda_tuning_nouns_3-10_topics.csv', index=False)
    pbar.close()

# print preprocess runtime
print(time.strftime(f'%H hours, %M minutes, %S seconds', time.gmtime(time.time() - start_time)))

100%|██████████| 210/210 [3:59:55<00:00, 68.55s/it]

03 hours, 59 minutes, 55 seconds





In [None]:
lda_tuning_results.sort_values('Coherence',ascending=False).head(30)

Unnamed: 0,Topics,Alpha,Beta,Coherence
160,8,0.5,0.01,0.623141
205,9,asymmetric,0.01,0.616538
161,8,0.5,0.1,0.615413
185,9,0.1,0.01,0.614666
206,9,asymmetric,0.1,0.614617
191,9,0.5,0.1,0.613899
176,8,asymmetric,0.1,0.612633
190,9,0.5,0.01,0.611294
179,8,asymmetric,symmetric,0.611154
164,8,0.5,symmetric,0.610316


It seems that 8-9 topics are the sweet spot for these particular parameters. I will withold judgement before speculating on what these particular results means until all the other strategies have been attempted. These results will be compared to other strategies in the final step of the project -[optimized model deployment notebook](https://github.com/tarrantcarter/Final_Capstone/blob/main/Optimal_Deployed_Motivational_Tweet_Generator.ipynb). 