# Topic Modeling via Latent Dirichlet Allocation (LDA)

We model topics for our dataset using LDA, selecting hyperparameters for the model using the ```c_v``` coherence metric.

See [this paper](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) for a definition of the various coherence metrics implemented in Gensim's [```CoherenceModel```](https://radimrehurek.com/gensim/models/coherencemodel.html).

In [2]:
import pandas as pd
import sys
sys.path.append("..")

from util.model import LDATuner
from itertools import product
import pickle

%load_ext autoreload
%autoreload 2

In [3]:
df = pd.read_pickle("../data/avatar_fics_processed.pickle")
docs = df["processed"]

We use our ```LDATuner``` class from [```util/model.py```](../util/model.py). The parameters we tune are ```num_topics```, the number of topics in the model, ```min_thresh```, the minimum number of documents a word must appear in to be included in the dictionary, and ```max_thresh```, the maximum percentage of documents a word can appear in before being excluded from the dictionary.

Some external testing was conducted for ```alpha``` and ```beta``` and we concluded that ```asymmetric``` and ```symmetric``` were the superior options.

In [5]:
tune = LDATuner(docs, verbose=False)
keywords=["num_topics", "min_thresh", "max_thresh", "alpha", "beta"]

Let's fix a minimum and maximum threshold and see which number of topics seems best. 

It won't necessarily be best globally as we have fixed our thresholds, but we use it here as an approximation and an idea of where to conduct a search with greater granularity.

In [12]:
num_topics = [8, 9, 10, 11, 12, 13, 14]
min_thresh = [25]
max_thresh = [0.3]

alpha=['asymmetric']
beta=['symmetric']

count = 0

iterations = 400
passes = 20
 
keywords=["num_topics", "min_thresh", "max_thresh", "alpha", "beta"]
param_grid = product(num_topics, min_thresh, max_thresh, alpha, beta)

model_scores = tune.tune_hyper(keywords, param_grid)

beginning parameter evaluation...
trained model 1 with params {'num_topics': 8, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.39683902867471493
trained model 2 with params {'num_topics': 9, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.3962585887599385
trained model 3 with params {'num_topics': 10, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.3918289076672109
trained model 4 with params {'num_topics': 11, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.4126942433088722
trained model 5 with params {'num_topics': 12, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.411716796986831
trained model 6 with params {'num_topics': 13, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.396659267829435
trained model 7 with params

In [14]:
num_topics = [15, 16, 17, 18, 19, 20]
min_thresh = [25]
max_thresh = [0.3]

alpha=['asymmetric']
beta=['symmetric']

count = 0

iterations = 400
passes = 20
 
keywords=["num_topics", "min_thresh", "max_thresh", "alpha", "beta"]
param_grid = product(num_topics, min_thresh, max_thresh, alpha, beta)

model_scores = tune.tune_hyper(keywords, param_grid)

beginning parameter evaluation...
trained model 1 with params {'num_topics': 15, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.3967814753133002
trained model 2 with params {'num_topics': 16, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.40407837280180503
trained model 3 with params {'num_topics': 17, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.4016013520084576
trained model 4 with params {'num_topics': 18, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.3988050328218545
trained model 5 with params {'num_topics': 19, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.3987175396452929
trained model 6 with params {'num_topics': 20, 'min_thresh': 25, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
coherence 0.3830554628413303


We get the best results with 11 and 12 topics. Let's do some more fine-grained searching in the 11- and 12-topic range, this time focusing our efforts on tuning the dictionary inclusion thresholds.

In [10]:
num_topics = [11, 12]
min_thresh = [20, 30, 40, 50]
max_thresh = [0.25, 0.3, 0.35]
alpha=['asymmetric']
beta=['symmetric']

iterations = 400
passes = 20
 
keywords=["num_topics", "min_thresh", "max_thresh", "alpha", "beta"]
param_grid = product(num_topics, min_thresh, max_thresh, alpha, beta)

model_scores = tune.tune_hyper(keywords, param_grid)

beginning parameter evaluation...
trained model 1 with params (11, 20, 0.25, 'asymmetric', 'symmetric')
coherence 0.41789481019747027
trained model 2 with params (11, 20, 0.3, 'asymmetric', 'symmetric')
coherence 0.3580443453232482
trained model 3 with params (11, 20, 0.35, 'asymmetric', 'symmetric')
coherence 0.3867188910071155
trained model 4 with params (11, 30, 0.25, 'asymmetric', 'symmetric')
coherence 0.3968217189622278
trained model 5 with params (11, 30, 0.3, 'asymmetric', 'symmetric')
coherence 0.3987019767949158
trained model 6 with params (11, 30, 0.35, 'asymmetric', 'symmetric')
coherence 0.382660823021126
trained model 7 with params (11, 40, 0.25, 'asymmetric', 'symmetric')
coherence 0.3759876930648349
trained model 8 with params (11, 40, 0.3, 'asymmetric', 'symmetric')
coherence 0.36952650504187884
trained model 9 with params (11, 40, 0.35, 'asymmetric', 'symmetric')
coherence 0.3754049225982536
trained model 10 with params (11, 50, 0.25, 'asymmetric', 'symmetric')
cohere

In [6]:
#pickle.dump(model_scores, open("../models/avatar_model_scores.pickle", "wb" ) )
model_scores = pickle.load(open("../models/avatar_model_scores.pickle", "rb"))

We've found the several good solutions (by ``c_v`` coherence) among the various parameter combinations we evaluated. Let's train a few of the best, then evaluate them by comparing the topics to our domain knowledge.

We don't automatically select the best model by coherence because the metric isn't necessarily the most indicative of what a human would judge to be a good topic model. In fact, the original paper evaluates coherence metrics by their correlation with human ratings; ```c_v``` was the highest at 0.731.

In [8]:
model_info_1 = tune.train_model_params((11, 50, 0.25, 'asymmetric', 'symmetric'), keywords)
model_info_2 = tune.train_model_params((11, 50, 0.3, 'asymmetric', 'symmetric'), keywords)
model_info_3 = tune.train_model_params((12, 50, 0.3, 'asymmetric', 'symmetric'), keywords)
model_info_4 = tune.train_model_params((12, 50, 0.35, 'asymmetric', 'symmetric'), keywords)

training model with params {'num_topics': 11, 'min_thresh': 50, 'max_thresh': 0.25, 'alpha': 'asymmetric', 'beta': 'symmetric'}
training model with params {'num_topics': 11, 'min_thresh': 50, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
training model with params {'num_topics': 12, 'min_thresh': 50, 'max_thresh': 0.3, 'alpha': 'asymmetric', 'beta': 'symmetric'}
training model with params {'num_topics': 12, 'min_thresh': 50, 'max_thresh': 0.35, 'alpha': 'asymmetric', 'beta': 'symmetric'}


What do our topics look like?

We print the topics for each of the top 4 models, then evaluate them by an eye test. We omit the in-depth reasoning as that would involve detailed domain knowledge.

In [17]:
mod_1 = model_info_1["model"]
for i in range(mod_1.num_topics):
    print(mod_1.show_topic(i, topn=20), "\n")

[('cell', 0.0052065845), ('waterbender', 0.0051848777), ('element', 0.0049300576), ('zhao', 0.0046798284), ('destroy', 0.003947909), ('destiny', 0.0038004091), ('chapter', 0.003639634), ('waterbend', 0.0036186203), ('create', 0.003576232), ('kyoshi', 0.003508324), ('earthbend', 0.003464028), ('earthbender', 0.0034163648), ('prisoner', 0.003387849), ('momo', 0.003169153), ('betray', 0.0030530854), ('blast', 0.00303634), ('king', 0.0028455232), ('tent', 0.0027666674), ('island', 0.0027584834), ('powerful', 0.0027563642)] 

[('car', 0.00731236), ('boyfriend', 0.005289455), ('book', 0.004577607), ('party', 0.0036964887), ('glass', 0.003665543), ('school', 0.0036544644), ('god', 0.0035582413), ('cute', 0.0034575143), ('cat', 0.0031686102), ('fucking', 0.0031381347), ('music', 0.003077253), ('couch', 0.0029087986), ('bar', 0.0028817586), ('wanna', 0.0027396043), ('marry', 0.002635631), ('asshole', 0.002618692), ('totally', 0.002530348), ('dude', 0.0025076896), ('band', 0.0024658162), ('chapt

Our best model actaully looks fairly bad; several areas seem mixed together in certain topics based on domain knowledge.

In [15]:
mod_2 = model_info_2["model"]
for i in range(mod_2.num_topics):
    print(mod_2.show_topic(i, topn=20), "\n")

[('yue', 0.009995302), ('fish', 0.0033362852), ('chief', 0.0031552024), ('ocean', 0.0031141932), ('snow', 0.0029772522), ('sand', 0.0029670584), ('tent', 0.002942843), ('waterbender', 0.0027138009), ('bender', 0.002672898), ('glow', 0.0024474605), ('beach', 0.0023836163), ('sea', 0.0023337086), ('village', 0.0022296438), ('fur', 0.002123047), ('river', 0.002066429), ('northern', 0.002035358), ('boat', 0.0019597064), ('human', 0.0019594538), ('metal', 0.0019322705), ('path', 0.0019311558)] 

[('wound', 0.0028220515), ('cough', 0.0026569832), ('thumb', 0.0026519957), ('flower', 0.0025177717), ('lung', 0.0024289712), ('bandage', 0.0024243663), ('scratch', 0.0024155288), ('sweat', 0.0023694513), ('bowl', 0.0023646413), ('mask', 0.0023434875), ('breathing', 0.00231163), ('pink', 0.0022579492), ('blade', 0.00219806), ('rib', 0.002145332), ('tighten', 0.0021418969), ('injury', 0.0021395227), ('pillow', 0.0021296847), ('bone', 0.0020614164), ('heartbeat', 0.0020598187), ('hip', 0.0020574732)] 

Model 2 seems to have fairly well-grouped topics!

In [17]:
mod_3 = model_info_3["model"]
for i in range(mod_3.num_topics):
    print(mod_3.show_topic(i, topn=20), "\n")

[('yue', 0.009530093), ('fish', 0.0032167402), ('chief', 0.0031647927), ('sand', 0.0030818277), ('ocean', 0.0030097922), ('tent', 0.0029554453), ('snow', 0.0029227654), ('bender', 0.0026799412), ('waterbender', 0.0026044806), ('beach', 0.002561594), ('glow', 0.0024072512), ('sea', 0.0024047513), ('village', 0.002248884), ('fur', 0.0021251645), ('boat', 0.0019733098), ('waterbend', 0.0019544319), ('river', 0.0019527501), ('northern', 0.001900886), ('path', 0.0018778432), ('human', 0.0018557592)] 

[('blue spirit', 0.0039816294), ('mask', 0.0038722036), ('wound', 0.003157231), ('bandage', 0.0027920718), ('bowl', 0.0027117515), ('blade', 0.0026929288), ('breathing', 0.002566681), ('nightmare', 0.0024728186), ('injury', 0.0024477127), ('scratch', 0.0023896857), ('lung', 0.0023690986), ('sweat', 0.0022897415), ('cough', 0.0022518584), ('rib', 0.0022041749), ('heartbeat', 0.0021947229), ('thumb', 0.0021540388), ('spar', 0.0021384323), ('tighten', 0.0019627062), ('energy', 0.0019541818), ('pi

In [16]:
mod_4 = model_info_4["model"]
for i in range(mod_4.num_topics):
    print(mod_4.show_topic(i, topn=20), "\n")

[('ty lee', 0.03639176), ('letter', 0.006846755), ('marry', 0.00609534), ('princess', 0.0045019416), ('servant', 0.00364631), ('ambassador', 0.003522113), ('garden', 0.0033710748), ('throne', 0.0033708117), ('royal', 0.0033680575), ('lady', 0.0033506474), ('chief', 0.0032625394), ('relationship', 0.0032096137), ('baby', 0.0030548053), ('king', 0.0029774983), ('parent', 0.002895992), ('advisor', 0.002826797), ('bumi', 0.0028143683), ('daughter', 0.002812721), ('party', 0.0027870736), ('southern', 0.0027830324)] 

[('hakoda', 0.043646127), ('lu ten', 0.009712686), ('dragon', 0.00806481), ('ursa', 0.0073421425), ('princess', 0.006787869), ('azulon', 0.00612223), ('bato', 0.005404968), ('chief', 0.005386476), ('feature', 0.0053089866), ('white lotus', 0.0046216925), ('grandfather', 0.004573473), ('royal', 0.004235033), ('harbor', 0.0041295635), ('ship', 0.0039406572), ('southern', 0.0039030924), ('pakku', 0.0038750982), ('village', 0.0037707614), ('waterbender', 0.0034070606), ('leader', 0

Models 3 and 4 are slightly worse than model 2, subjectively. 

We choose model 2 as our final model. We'll analyze the model and cluster the data points in [```4_cluster.ipynb```](4_cluster.ipynb).

In [18]:
model_info = model_info_2
pickle.dump(model_info, open("../models/avatar_model.pickle", "wb" ) )