# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 0: Warm-Up
* Part 1: Describe how an LDA Model works
* Part 2: Estimate a LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

# Part 0: Warm-Up
How do we do a grid search? 

In [62]:
import warnings
warnings.filterwarnings("ignore")

In [63]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [64]:
# Load training data
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'))

# Load testing data
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'))

print(f'Training Samples: {len(newsgroups_train.data)}')
print(f'Testing Samples: {len(newsgroups_test.data)}')

Training Samples: 11314
Testing Samples: 7532


In [65]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [66]:
newsgroups_train['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [67]:
newsgroups_train['data'][1000]

"Anybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?\nSorry, don't know the version of the driver (no indication in the menus) but it's a recently\ndelivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered\nif anyone else had seen this.\n\npost or email"

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [68]:
# Instantiate vectorizer
vect = TfidfVectorizer()

# Transform the training data
X_train = vect.fit_transform(newsgroups_train['data'])
print(X_train.shape)

(11314, 101631)


In [69]:
params_1 = {
    'min_samples_leaf': [1, 2, 5, 10]
}

# Instantiate classifier
clf = RandomForestClassifier()

# GridSearch
gs1 = GridSearchCV(clf, params_1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, newsgroups_train['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:  1.2min remaining:    8.1s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.2min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [70]:
gs1.best_score_

0.6557360498512768

In [71]:
gs1.best_params_

{'min_samples_leaf': 2}

In [72]:
test_sample = vect.transform(["The new york yankees are the best team in the region."])
test_sample.shape

(1, 101631)

In [73]:
gs1.predict(test_sample)[0]

10

In [74]:
newsgroups_train['target_names'][9]

'rec.sport.baseball'

### GridSearch with BOTH the Vectoizer & Classifier

In [75]:
from sklearn.pipeline import Pipeline

# 1. Create a pipeline with a vectorize and a classifier
# 2. Use Grid Search to optimize the entire pipeline
pipe = Pipeline([
    ('vect',TfidfVectorizer()),
    ('clf',RandomForestClassifier(random_state=42))
])

params_2 = {
    'vect__stop_words': (None,'english'),
    'vect__min_df': (2,5),
    'clf__max_depth': (10, None)
}

gs2 = GridSearchCV(pipe, params_2, cv=5, n_jobs=-1, verbose=1)
gs2.fit(newsgroups_train['data'], newsgroups_train['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.2min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [76]:
gs2.best_score_

0.6607746264533867

In [77]:
gs2.best_params_

{'clf__max_depth': None, 'vect__min_df': 2, 'vect__stop_words': 'english'}

In [78]:
pred = gs2.predict(["The new york yankees are the best team in the region."])
pred

array([9])

In [79]:
newsgroups_train['target_names'][pred[0]]

'rec.sport.baseball'

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

# Part 1: Describe how an LDA Model works

[Your Guide to Latent Dirichlet Allocation](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)

[LDA Topic Modeling](https://lettier.com/projects/lda-topic-modeling/)

[Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [21]:
# Download spacy model
import spacy.cli
spacy.cli.download("en_core_web_lg")

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')


In [22]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [80]:
df = pd.DataFrame({
    'content': newsgroups_train['data'],
    'target': newsgroups_train['target'],
    'target_names': [newsgroups_train['target_names'][i] for i in newsgroups_train['target']]
})
print(df.shape)

(11314, 3)


In [81]:
pd.set_option('display.max_colwidth', 0)
df.sample(3)

Unnamed: 0,content,target,target_names
2235,\nMaybe...then again did you get rid of that H/D of yorn and buy a rice rocket \nof your own? That would certainly explain the friendliness...unless you \nmaybe had a piece of toilet paper stuck on the bottom of your boot...8-).\n\nRich\n,8,rec.motorcycles
804,It worked!!!\nThank you very much!\n,3,comp.sys.ibm.pc.hardware
9779,"\nJim, please, that's a lame explanation of the trinity that Jesus provides\nabove. Baptizing people in the name of three things != trinity. If\nthis is the case, then I'm wrong, I assumed that trinity implies that\nGod is three entities, and yet the same.\n\nCheers,\nKent",19,talk.religion.misc


In [82]:
# For reference on regex: https://docs.python.org/3/library/re.html

# From 'content' column: 

# 1. Remove new line characters
df['clean_text'] = df['content'].apply(lambda x: re.sub('\s+', ' ', x))
# 2. Remove extra whitespace 
df['clean_text'] = df['clean_text'].apply(lambda x: ' '.join(x.split()))
# 3. Remove Emails
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub('From: \S+@\S+', '', x))
# 4. Remove non-alphanumeric characters
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))


In [83]:
df.sample(3)

Unnamed: 0,content,target,target_names,clean_text
8535,"\nYou could take my wrongly spelled surname :-).\n\nCheers,\nKent Sandvik",0,alt.atheism,You could take my wrongly spelled surname Cheers Kent Sandvik
5114,"Last week I asked for help in getting an old homemade amp working with\nmy Sun CD-ROM drive. It turns out that the channel I was testing with\nwas burned out in the amp. The other channel works fine.\n\nSo now I need a new amplifier chip. My local Radio Shack no longer\ncarries components! The chip is a 12 pin SIP (?) labelled with BA5406\nand then ""502 515"" below that.\n\nDoes anyone have a source? Thanks,",12,sci.electronics,Last week I asked for help in getting an old homemade amp working with my Sun CD ROM drive It turns out that the channel I was testing with was burned out in the amp The other channel works fine So now I need a new amplifier chip My local Radio Shack no longer carries components The chip is a pin SIP labelled with BA and then below that Does anyone have a source Thanks
5913,"\n\nThe theory is that the hollering kills the spirit of the criminal/Nazi \nArmenians of the ASALA/SDPA/ARF Terrorism and Revisionism Triangle. \nNow, try dealing with the rest of what I wrote.\n\nWhat is more, the activities of the Armenian Government seem to have been\nefforts aimed at eradicating a race (the Turks) or aimed at carrying out a\none-sided feud, instead of being a struggle for liberation. From the outset,\nthe efforts of the Armenian revolutionaries within the Ottoman borders took\nthe form of terrorist and destructive actions aimed at mass murder, cruelty\nand genocide, so that no other interpretation of them is possible. Armenian\nactivities started during the reign of Abdulhamid II as individual acts of\nterror, and then developed into assassinations and surprise attacks. The element\nof brute force in these activities increased steadily, culminating in mass\nrebellions and widespread fighting during the First World War. Furthermore,\nwhen the Ottoman army withdrew from Eastern Anatolia after the 1915 Sarikamis\ndefeat, Armenian revolutionaries initiated a series of cruelties in this area.\nAlthough the Russians occupied Eastern Anatolia as an enemy, nevertheless they\nwere constrained by the rules of war. However, when they returned to their\ncountry in 1917 after the Revolution, Armenian revolutionaries were unchecked\nin this area for about a year until the Ottoman forces returned to Erzurum\nin 1918. During this period, Armenian revolutionaries executed massacres on\nthe local people which is recorded in historical documents.[1]\n\nFor example, let us look at a report dated 21 March 1918 which the Commander\nof the Third Army submitted when he entered Erzurum and Erzincan: \n\n ""They were completely and systematically destroyed and burned down \n by Armenians, even the trees were cut down, and they are like a \n building entirely consumed by fire in every sense of the word."" \n\nAs for the people who had been living in Erzurum and Erzincan:\n\n""Those who were capable of fighting were taken away at the very beginning\n with the excuse of forced labor in road construction, they were taken\n in the direction of Sarikamis and annihilated. When the Russian army\n withdrew, a part of the remaining people was destroyed in Armenian\n massacres and cruelties: they were thrown into wells, they were locked\n in houses and burned down, they were killed with bayonets and swords, in places\n selected as butchering spots, their bellies were torn open, their lungs\n were pulled out, and girls and women were hanged by their hair after\n being subjected to every conceivable abominable act. A very small part \n of the people who were spared these abominations far worse than the\n cruelty of the inquisition resembled living dead and were suffering\n from temporary insanity because of the dire poverty they had lived\n in and because of the frightful experiences they had been subjected to.\n Including women and children, such persons discovered so far do not\n exceed one thousand five hundred in Erzincan and thirty thousand in\n Erzurum. All the fields in Erzincan and Erzurum are untilled, everything\n that the people had has been taken away from them, and we found them\n in a destitute situation. At the present time, the people are subsisting\n on some food they obtained, impelled by starvation, from Russian storages\n left behind after their occupation of this area.""[2]\n \nForeign observers who witnessed the events, including Russian Officers\nwho did not desert their lines, submitted detailed reports proving the\ngenocide to Ottoman commanders who received them as prisoners of war.\nWhat is most important is that they stated in their reports 'the \nmassacres did not happen by chance but were planned.'[3]\n\nAt the end of the war, the German author Dr. Weiss, his Austrian colleague\nDr. Stein and his Turkish colleague Mr. Ahmet Vefik visited Trabzon, Kars,\nErzurum and Batum between April 17th and May 20th 1918 to record the\ncruelties. Their writings not only show the scope of Armenian activities,\nbut also reveal their goal and true nature.[4]\n\n[1] (The Ottoman State, the Ministry of War), ""Islam Ahalinin Ducar Olduklari\n Mezalim Hakkinda Vesaike Mustenid Malumat,"" (Istanbul, 1918). The French\n version: ""Documents Relatifs aux Atrocites Commises par les Armeniens sur\n la Population Musulmane,"" (Istanbul, 1919). In the Latin script: H. K.\n Turkozu, ed., ""Osmanli ve Sovyet Belgeleriyle Ermeni Mezalimi,"" (Ankara,\n 1982). In addition: Z. Basar, ed., ""Ermenilerden Gorduklerimiz,"" (Ankara,\n 1974) and, edited by the same author, ""Ermeniler Hakkinda Makaleler -\n Derlemeler,"" (Ankara, 1978). ""Askeri Tarih Belgeleri ...,"" Vol. 32, 83\n (December 1983), document numbered 1881.\n[2] ""Askeri Tarih Belgeleri ....,"" Vol. 31, 81 (December 1982), document\n numbered 1869.\n[3] From Twerdo-Khlebof's report dated 29 April 1918; quoted in Ermeniler ...,\n Vol. 2, p. 275.\n[4] A. R. (Altinay), ""Iki Komite - Iki Kital,"" (Istanbul, 1919), and, ""Kafkas\n Yollarinda Hatiralar ve Tahassusler"" (Istanbul, 1919).\n\n\nSerdar Argic",17,talk.politics.mideast,The theory is that the hollering kills the spirit of the criminal Nazi Armenians of the ASALA SDPA ARF Terrorism and Revisionism Triangle Now try dealing with the rest of what I wrote What is more the activities of the Armenian Government seem to have been efforts aimed at eradicating a race the Turks or aimed at carrying out a one sided feud instead of being a struggle for liberation From the outset the efforts of the Armenian revolutionaries within the Ottoman borders took the form of terrorist and destructive actions aimed at mass murder cruelty and genocide so that no other interpretation of them is possible Armenian activities started during the reign of Abdulhamid II as individual acts of terror and then developed into assassinations and surprise attacks The element of brute force in these activities increased steadily culminating in mass rebellions and widespread fighting during the First World War Furthermore when the Ottoman army withdrew from Eastern Anatolia after the Sarikamis defeat Armenian revolutionaries initiated a series of cruelties in this area Although the Russians occupied Eastern Anatolia as an enemy nevertheless they were constrained by the rules of war However when they returned to their country in after the Revolution Armenian revolutionaries were unchecked in this area for about a year until the Ottoman forces returned to Erzurum in During this period Armenian revolutionaries executed massacres on the local people which is recorded in historical documents For example let us look at a report dated March which the Commander of the Third Army submitted when he entered Erzurum and Erzincan They were completely and systematically destroyed and burned down by Armenians even the trees were cut down and they are like a building entirely consumed by fire in every sense of the word As for the people who had been living in Erzurum and Erzincan Those who were capable of fighting were taken away at the very beginning with the excuse of forced labor in road construction they were taken in the direction of Sarikamis and annihilated When the Russian army withdrew a part of the remaining people was destroyed in Armenian massacres and cruelties they were thrown into wells they were locked in houses and burned down they were killed with bayonets and swords in places selected as butchering spots their bellies were torn open their lungs were pulled out and girls and women were hanged by their hair after being subjected to every conceivable abominable act A very small part of the people who were spared these abominations far worse than the cruelty of the inquisition resembled living dead and were suffering from temporary insanity because of the dire poverty they had lived in and because of the frightful experiences they had been subjected to Including women and children such persons discovered so far do not exceed one thousand five hundred in Erzincan and thirty thousand in Erzurum All the fields in Erzincan and Erzurum are untilled everything that the people had has been taken away from them and we found them in a destitute situation At the present time the people are subsisting on some food they obtained impelled by starvation from Russian storages left behind after their occupation of this area Foreign observers who witnessed the events including Russian Officers who did not desert their lines submitted detailed reports proving the genocide to Ottoman commanders who received them as prisoners of war What is most important is that they stated in their reports the massacres did not happen by chance but were planned At the end of the war the German author Dr Weiss his Austrian colleague Dr Stein and his Turkish colleague Mr Ahmet Vefik visited Trabzon Kars Erzurum and Batum between April th and May th to record the cruelties Their writings not only show the scope of Armenian activities but also reveal their goal and true nature The Ottoman State the Ministry of War Islam Ahalinin Ducar Olduklari Mezalim Hakkinda Vesaike Mustenid Malumat Istanbul The French version Documents Relatifs aux Atrocites Commises par les Armeniens sur la Population Musulmane Istanbul In the Latin script H K Turkozu ed Osmanli ve Sovyet Belgeleriyle Ermeni Mezalimi Ankara In addition Z Basar ed Ermenilerden Gorduklerimiz Ankara and edited by the same author Ermeniler Hakkinda Makaleler Derlemeler Ankara Askeri Tarih Belgeleri Vol December document numbered Askeri Tarih Belgeleri Vol December document numbered From Twerdo Khlebof s report dated April quoted in Ermeniler Vol p A R Altinay Iki Komite Iki Kital Istanbul and Kafkas Yollarinda Hatiralar ve Tahassusler Istanbul Serdar Argic


In [84]:
nlp = spacy.load("en_core_web_lg")

In [85]:
# Leverage tqdm for progress_apply
from tqdm import tqdm
tqdm.pandas()

# If you're on macOS, Linux, or python session executed from Windows Subsystem for Linux (WSL)
# conda activate U4-S1-NLP
# pip install pandarallel
#
# from pandarallel import pandarallel
# pandarallel.initialize(progress_bar=True)
#
# df['lemmas'] = df['content'].parallel_apply(get_lemmas)
#
# Ref: https://github.com/nalepae/pandarallel

In [86]:
# Create 'lemmas' column
def get_lemmas(x):
    lemmas = []
    for token in nlp(x):
        if (token.is_stop!=True) and (token.is_punct!=True):
            lemmas.append(token.lemma_)
    return lemmas

df['lemmas'] = df['clean_text'].progress_apply(get_lemmas)

100%|████████████████████████████████████████████████████████████████████████████| 11314/11314 [04:47<00:00, 39.32it/s]


In [87]:
df.head()

Unnamed: 0,content,target,target_names,clean_text,lemmas
0,"I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.",7,rec.autos,I was wondering if anyone out there could enlighten me on this car I saw the other day It was a door sports car looked to be from the late s early s It was called a Bricklin The doors were really small In addition the front bumper was separate from the rest of the body This is all I know If anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail,"[wonder, enlighten, car, see, day, , , door, sport, car, , look, late, , s, , early, , s, , call, Bricklin, , door, small, , addition, , bumper, separate, rest, body, , know, , tellme, model, , engine, spec, , year, production, , car, , history, , info, funky, look, car, , e, mail]"
1,"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.",4,comp.sys.mac.hardware,A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll Please send a brief message detailing your experiences with the procedure Top speed attained CPU rated speed add on cards and adapters heat sinks hour of usage per day floppy disk functionality with and m floppies are especially requested I will be summarizing in the next two days so please add to the network knowledge base if you have done the clock upgrade and haven t answered this poll Thanks,"[fair, number, brave, soul, upgrade, SI, clock, oscillator, share, experience, poll, , send, brief, message, detail, experience, procedure, , speed, attain, , cpu, rate, speed, , add, card, adapter, , heat, sink, , hour, usage, day, , floppy, disk, functionality, , , m, floppy, especially, request, , summarize, day, , add, network, knowledge, base, clock, upgrade, haven, t, answer, poll, , thank]"
2,"well folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i'm in the market for a\nnew machine a bit sooner than i intended to be...\n\ni'm looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected? i'd heard the 185c was supposed to make an\nappearence ""this summer"" but haven't heard anymore on it - and since i\ndon't have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo's just went through recently?\n\n* what's the impression of the display on the 180? i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don't really have\na feel for how much ""better"" the display is (yea, it looks great in the\nstore, but is that all ""wow"" or is it really that good?). could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display? (i realize\nthis is a real subjective question, but i've only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - if you could email, i'll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering",4,comp.sys.mac.hardware,well folks my mac plus finally gave up the ghost this weekend after starting life as a k way back in sooo i m in the market for a new machine a bit sooner than i intended to be i m looking into picking up a powerbook or maybe and have a bunch of questions that hopefully somebody can answer does anybody know any dirt on when the next round of powerbook introductions are expected i d heard the c was supposed to make an appearence this summer but haven t heard anymore on it and since i don t have access to macleak i was wondering if anybody out there had more info has anybody heard rumors about price drops to the powerbook line like the ones the duo s just went through recently what s the impression of the display on the i could probably swing a if i got the Mb disk rather than the but i don t really have a feel for how much better the display is yea it looks great in the store but is that all wow or is it really that good could i solicit some opinions of people who use the and day to day on if its worth taking the disk size and money hit to get the active display i realize this is a real subjective question but i ve only played around with the machines in a computer store breifly and figured the opinions of somebody who actually uses the machine daily might prove helpful how well does hellcats perform thanks a bunch in advance for any info if you could email i ll post a summary news reading time is at a premium with finals just around the corner Tom Willis twillis ecn purdue edu Purdue Electrical Engineering,"[folk, , mac, plus, finally, give, ghost, weekend, start, life, , k, way, , sooo, , m, market, new, machine, bit, sooner, intend, , m, look, pick, powerbook, , maybe, , bunch, question, , hopefully, , somebody, answer, , anybody, know, dirt, round, powerbook, introduction, expect, , d, hear, , c, suppose, appearence, , summer, , haven, t, hear, anymore, , don, t, access, macleak, , wonder, anybody, info, , anybody, hear, rumor, price, drop, powerbook, line, like, one, duo, s, go, recently, , s, impression, display, , probably, swing, , get, , Mb, disk, , don, t, feel, , ...]"
3,\nDo you have Weitek's address/phone number? I'd like to get some information\nabout this chip.\n,1,comp.graphics,Do you have Weitek s address phone number I d like to get some information about this chip,"[Weitek, s, address, phone, number, , d, like, information, chip]"
4,"From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):\n\n\nMy understanding is that the 'expected errors' are basically\nknown bugs in the warning system software - things are checked\nthat don't have the right values in yet because they aren't\nset till after launch, and suchlike. Rather than fix the code\nand possibly introduce new bugs, they just tell the crew\n'ok, if you see a warning no. 213 before liftoff, ignore it'.",14,sci.space,From article C owCB n p world std com by tombaker world std com Tom A Baker My understanding is that the expected errors are basically known bugs in the warning system software things are checked that don t have the right values in yet because they aren t set till after launch and suchlike Rather than fix the code and possibly introduce new bugs they just tell the crew ok if you see a warning no before liftoff ignore it,"[article, , C, owcb, n, p, world, std, com, , tombaker, world, std, com, , Tom, Baker, , understanding, , expect, error, , basically, know, bug, warning, system, software, , thing, check, don, t, right, value, aren, t, set, till, launch, , suchlike, , fix, code, possibly, introduce, new, bug, , tell, crew, , ok, , warning, , liftoff, , ignore, ]"


### The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

In [88]:
# Create Dictionary
id2word = corpora.Dictionary(df['lemmas'] )

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [89]:
# How many words do we have?
len(id2word.keys())

77754

In [90]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=5, no_above=0.75)

In [91]:
# How many words do we have?
len(id2word.keys())

14778

In [92]:
id2word[300]

'sheet'

In [93]:
df['content'][5]

'\n\n\n\n\nOf course.  The term must be rigidly defined in any bill.\n\n\nI doubt she uses this term for that.  You are using a quote allegedly\nfrom her, can you back it up?\n\n\n\n\nI read the article as presenting first an argument about weapons of mass\ndestruction (as commonly understood) and then switching to other topics.\nThe first point evidently was to show that not all weapons should be\nallowed, and then the later analysis was, given this understanding, to\nconsider another class.\n\n\n\n'

In [94]:
corpus[5]

[(0, 11),
 (117, 1),
 (177, 1),
 (193, 1),
 (221, 1),
 (225, 1),
 (226, 1),
 (227, 1),
 (228, 1),
 (229, 1),
 (230, 1),
 (231, 1),
 (232, 1),
 (233, 1),
 (234, 1),
 (235, 1),
 (236, 1),
 (237, 1),
 (238, 1),
 (239, 1),
 (240, 1),
 (241, 1),
 (242, 1),
 (243, 1),
 (244, 1),
 (245, 1),
 (246, 2),
 (247, 1),
 (248, 1),
 (249, 2)]

In [95]:
id2word[252]

'rm'

In [96]:
id2word[276]

'controller'

In [97]:
# Human readable format of corpus (term-frequency)
[(id2word[word_id], word_count) for word_id, word_count in corpus[5]]

[('  ', 11),
 ('helpful', 1),
 ('address', 1),
 ('ignore', 1),
 ('course', 1),
 ('evidently', 1),
 ('later', 1),
 ('mass', 1),
 ('point', 1),
 ('present', 1),
 ('quote', 1),
 ('read', 1),
 ('switch', 1),
 ('term', 1),
 ('topic', 1),
 ('understand', 1),
 ('weapon', 1),
 ('News', 1),
 ('Sean', 1),
 ('September', 1),
 ('Sharon', 1),
 ('accidentally', 1),
 ('bounce', 1),
 ('couldn', 1),
 ('delete', 1),
 ('directly', 1),
 ('file', 2),
 ('glad', 1),
 ('instead', 1),
 ('prob', 2)]

# Part 2: Estimate a LDA Model with Gensim

 ### Train an LDA model

In [98]:
%%time
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=20, 
                                            chunksize=100,
                                            passes=10,
                                            per_word_topics=True)

# https://radimrehurek.com/gensim/models/ldamodel.html

IndexError: index 14778 is out of bounds for axis 1 with size 14778

In [99]:
# lda_model.save('lda_model.model')

In [None]:
%%time
lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=20, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)

# https://radimrehurek.com/gensim/models/ldamulticore.html

In [None]:
lda_multicore.save('lda_multicore.model')

In [None]:
from gensim import models
lda_multicore =  models.LdaModel.load('lda_multicore.model')

### View the topics in LDA model

In [None]:
newsgroups_train.target_names

In [None]:
pprint(lda_multicore.print_topics())
doc_lda = lda_multicore[corpus]

In [None]:
doc_lda

In [None]:
distro = [lda[d] for d in corpus]

### What is topic Perplexity?
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore, 
                                     texts=df['lemmas'], 
                                     dictionary=id2word, 
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Part 3: Interpret LDA results & Select the appropriate number of topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
pyLDAvis.display(vis)

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
%%time
model_list, coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        texts=df['lemmas'], 
                                                        start=2, 
                                                        limit=40, 
                                                        step=6)

In [None]:
coherence_values = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
#optimal_model = model_list[4]
optimal_model =  models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))