# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 0: Warm-Up
* Part 1: Describe how an LDA Model works
* Part 2: Estimate a LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

# Part 0: Warm-Up
How do we do a grid search? 

In [48]:
import warnings
warnings.filterwarnings("ignore")

  and should_run_async(code)


In [3]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
# Load training data
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'))

# Load testing data
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'))

print(f'Training Samples: {len(newsgroups_train.data)}')
print(f'Testing Samples: {len(newsgroups_test.data)}')

Training Samples: 11314
Testing Samples: 7532


In [5]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
newsgroups_train['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [7]:
newsgroups_train['data'][1000]

"Anybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?\nSorry, don't know the version of the driver (no indication in the menus) but it's a recently\ndelivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered\nif anyone else had seen this.\n\npost or email"

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [9]:
# Instantiate vectorizer
vect = TfidfVectorizer()

# Transform the training data
X_train = vect.fit_transform(newsgroups_train['data'])
print(X_train.shape)

(11314, 101631)


In [10]:
params_1 = {
    'min_samples_leaf': [1, 2, 5, 10]
}

# Instantiate classifier
clf = RandomForestClassifier()

# GridSearch
gs1 = GridSearchCV(clf, params_1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, newsgroups_train['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:  1.2min remaining:    7.9s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.2min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [11]:
gs1.best_score_

0.6588297577646474

In [12]:
gs1.best_params_

{'min_samples_leaf': 2}

In [13]:
test_sample = vect.transform(["The new york yankees are the best team in the region."])
test_sample.shape

(1, 101631)

In [14]:
gs1.predict(test_sample)[0]

9

In [15]:
newsgroups_train['target_names'][9]

'rec.sport.baseball'

### GridSearch with BOTH the Vectoizer & Classifier

In [16]:
from sklearn.pipeline import Pipeline

# 1. Create a pipeline with a vectorize and a classifier
# 2. Use Grid Search to optimize the entire pipeline
pipe = Pipeline([
    ('vect',TfidfVectorizer()),
    ('clf',RandomForestClassifier(random_state=42))
])

params_2 = {
    'vect__stop_words': (None,'english'),
    'vect__min_df': (2,5),
    'clf__max_depth': (10, None)
}

gs2 = GridSearchCV(pipe, params_2, cv=5, n_jobs=-1, verbose=1)
gs2.fit(newsgroups_train['data'], newsgroups_train['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.2min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [17]:
gs2.best_score_

0.6607746264533867

In [18]:
gs2.best_params_

{'clf__max_depth': None, 'vect__min_df': 2, 'vect__stop_words': 'english'}

In [19]:
pred = gs2.predict(["The new york yankees are the best team in the region."])
pred

array([9])

In [20]:
newsgroups_train['target_names'][pred[0]]

'rec.sport.baseball'

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

# Part 1: Describe how an LDA Model works

[Your Guide to Latent Dirichlet Allocation](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)

[LDA Topic Modeling](https://lettier.com/projects/lda-topic-modeling/)

[Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [21]:
# Download spacy model
import spacy.cli
spacy.cli.download("en_core_web_lg")

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')


In [22]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [23]:
df = pd.DataFrame({
    'content': newsgroups_train['data'],
    'target': newsgroups_train['target'],
    'target_names': [newsgroups_train['target_names'][i] for i in newsgroups_train['target']]
})
print(df.shape)

(11314, 3)


  and should_run_async(code)


In [24]:
pd.set_option('display.max_colwidth', 0)
df.sample(3)

  and should_run_async(code)


Unnamed: 0,content,target,target_names
10820,bm967@cleveland.Freenet.Edu (David Kantrowitz) writes ...\n\nSure. Buy a switch box and a multisync monitor. I have just that\narrangement on my desk and it works fine.\n,4,comp.sys.mac.hardware
6173,"THE DIVINE MASTERS \n \n Most Christians would agree, and correctly so, that \n Jesus Christ was a Divine Master, and a projection of God \n into the physical world, God Incarnate. \n \n But there are some very important related facts that \n Christians are COMPLETELY IGNORANT of, as are followers of \n most other world religions. \n \n First, Jesus Christ was NOT unique, John 3:16 NOTWITH-\n STANDING. There is ALWAYS at least one such Divine Master \n (God Incarnate) PHYSICALLY ALIVE in this world AT ALL TIMES, \n a continuous succession THROUGHOUT HISTORY, both before and \n after the life of Jesus. \n \n The followers of some of these Masters founded the \n world's major religions, usually PERVERTING the teachings of \n their Master in the process. Christians, for example, added \n THREATS of ""ETERNAL DAMNATION"" in Hell, and DELETED the \n teaching of REincarnation. \n \n Secondly, and more importantly, after a particular \n Master physically dies and leaves this world, there is \n NOTHING that He can do for ANYbody except for the relatively \n few people that He INITIATED while He was still PHYSICALLY \n alive. (THAT IS SIMPLY THE WAY GOD SET THINGS UP IN THE \n UNIVERSES.) \n\n Therefore, all those Christians who worship Jesus, and \n pray to Jesus, and expect Jesus to return and save them from \n their sins, are only KIDDING THEMSELVES, and have allowed \n themselves to be DUPED by a religion that was mostly \n MANUFACTURED by the Romans. \n \n And emotional ""feelings"" are a TOTALLY DECEIVING \n indicator for religious validity. \n \n These things are similarly true for followers of most \n other major world religions, including Islam. \n \n Thirdly, the primary function of each Master is to tune \n His Initiates into the ""AUDIBLE LIFE STREAM"" or ""SOUND \n CURRENT"", (referred to as ""THE WORD"" in John 1:1-5, and as \n ""The River of Life"" in Revelation 22:1), and to personally \n guide each of them thru the upper levels of Heaven while they \n are still connected to their living physical bodies by a \n ""silver cord"". \n \n True Salvation, which completes a Soul's cycles of \n REincarnation in the physical and psychic planes, is achieved \n only by reaching at least the ""SOUL PLANE"", which is five \n levels or universes above the physical universe, and this \n canNOT be done without the help of a PHYSICALLY-Living Divine \n Master. \n \n One such Divine Master alive today is an American, Sri \n Harold Klemp, the Living ""Eck"" Master or ""Mahanta"" for the \n ""Eckankar"" organization, now headquartered in Minneapolis, \n (P.O. Box 27300; zip 55427). \n \n Another Divine Master is Maharaj Gurinder Singh Ji, now \n living in Punjab, India, and is associated with the ""Sant \n Mat"" organization. \n \n One of the classic books on this subject is ""THE PATH OF \n THE MASTERS"" (Radha Soami Books, P.O. Box 242, Gardena, CA \n 90247), written in 1939 by Dr. Julian Johnson, a theologian \n and surgeon who spent the last years of his life in India \n studying under and closely observing the Sant Mat Master of \n that time, Maharaj Sawan Singh Ji. \n \n Several of the Eckankar books, including some authored \n by Sri Paul Twitchell or Sri Harold Klemp, can be found in \n most public and university libraries and some book stores, or \n obtained thru inter-library loan. The book ""ECKANKAR--THE \n KEY TO SECRET WORLDS"", by Sri Paul Twitchell, is ANOTHER \n classic. \n \n Many Christians are likely to confuse the Masters with \n the ""Anti-Christ"", which is or was to be a temporary world \n dictator during the so-called ""last days"". But the Masters \n don't ever rule, even when asked or expected to do so as \n Jesus was. \n \n People who continue following Christianity, Islam, or \n other orthodox religions with a physically-DEAD Master, will \n CONTINUE on their cycles of REincarnation, between the \n Psychic Planes and this MISERABLE physical world, until they \n finally accept Initiation from a PHYSICALLY-LIVING Divine \n Master. \n \n \n \n RE-INCARNATION\n \n The book ""HERE AND HEREAFTER"", by Ruth Montgomery, \n describes several kinds of evidence supporting REincarnation \n as a FACT OF LIFE, including HYPNOTIC REGRESSIONS to past \n lives [about 50% accurate; the subconscious mind sometimes \n makes things up, especially with a bad hypnotist], \n SPONTANEOUS RECALL (especially by young children, some of \n whom can identify their most recent previous relatives, \n homes, possessions, etc.), DREAM RECALL of past life experi-\n ences, DEJA VU (familiarity with a far off land while travel-\n ing there for the first time on vacation), the psychic read-\n ings of the late EDGAR CAYCE, and EVEN SUPPORTING STATEMENTS \n FROM THE CHRISTIAN BIBLE including Matthew 17:11-13 (John the \n Baptist was the REINCARNATION of Elias.) and John 9:1-2 (How \n can a person POSSIBLY sin before he is born, unless he LIVED \n BEFORE?!). [ ALWAYS use the ""KING JAMES VERSION"". Later \n versions are PER-VERSIONS! ] \n \n Strong INTERESTS, innate TALENTS, strong PHOBIAS, etc., \n typically originate from a person's PAST LIVES. For example, \n a strong fear of swimming in or traveling over water usually \n results from having DROWNED at the end of a PREVIOUS LIFE. \n And sometimes a person will take AN IMMEDIATE DISLIKE to \n another person being met for the first time in THIS life, \n because of a bad encounter with him during a PREVIOUS \n INCARNATION. \n\n The teaching of REincarnation also includes the LAW OF \n KARMA (Galatians 6:7, Revelation 13:10, etc.). People would \n behave much better toward each other if they knew that their \n actions in the present will surely be reaped by them in the \n future, or in a FUTURE INCARNATION! \n\n\n\n ""2nd COMINGS""\n\n If a Divine Master physically dies (""translates"") \n before a particular Initiate of His does, then when that \n Initiate physically dies (""translates""), the Master will meet \n him on the Astral level and take him directly to the Soul \n Plane. This is the ONE AND ONLY correct meaning of a 2nd \n Coming. It is an INDIVIDUAL experience, NOT something that \n happens for everyone all at once. People who are still \n waiting for Jesus' ""2nd Coming"" are WAITING IN VAIN. \n \n \n \n PLANES OF EXISTENCE\n\n The physical universe is the LOWEST of at least a DOZEN \n major levels of existence. Above the Physical Plane is the \n Astral Plane, the Causal Plane, the Mental Plane, the Etheric \n Plane (often counted as the upper part of the Mental Plane), \n the Soul Plane, and several higher Spiritual Planes. The \n Soul Plane is the FIRST TRUE HEAVEN, (counting upward from \n the Physical). The planes between (but NOT including) the \n Physical and Soul Planes are called the Psychic Planes. \n \n It is likely that ESP, telepathy, astrological \n influences, radionic effects, biological transmutations [See \n the 1972 book with that title.], and other phenomena without \n an apparent physical origin, result from INTERACTIONS between \n the Psychic Planes and the Physical Plane. \n \n The major planes are also SUB-DIVIDED. For example, a \n sub-plane of the Astral Plane is called ""Hades"", and the \n Christian Hell occupies a SMALL part of it, created there \n LESS THAN 2000 YEARS AGO by the EARLY CATHOLIC CHURCH by some \n kind of black magic or by simply teaching its existence in a \n THREATENING manner. The Christian ""Heaven"" is located \n elsewhere on the Astral Plane. Good Christians will go there \n for a short while and then REincarnate back to Earth. \n \n \n \n SOUND CURRENT vs. BLIND FAITH\n\n The Christian religion demands of its followers an \n extraordinary amount of BLIND FAITH backed up by little more \n than GOOD FEELING (which is TOTALLY DECEIVING). \n \n If a person is not HEARING some form of the ""SOUND \n CURRENT"" (""THE WORD"", ""THE BANI"", ""THE AUDIBLE LIFE STREAM""), \n then his cycles of REINCARNATION in this MISERABLE world WILL \n CONTINUE. \n \n The ""SOUND CURRENT"" manifests differently for different \n Initiates, and can sound like a rushing wind, ocean waves on \n the sea shore, buzzing bees, higher-pitched buzzing sound, a \n flute, various heavenly music, or other sounds. In Eckankar, \n Members start hearing it near the end of their first year as \n a Member. This and other experiences (such as ""SOUL TRAVEL"") \n REPLACE blind faith. \n \n\n\n For more information, answers to your questions, etc., \n please consult my CITED SOURCES (3 books, 2 addresses). \n\n\n\n UN-altered REPRODUCTION and DISSEMINATION of this \n IMPORTANT Information is ENCOURAGED. \n",19,talk.religion.misc
4606,"Let me try sending this message again, I botched up the margins the\nfirst time; *sorry* 'bout that :)\n\nDoes anyone out there know of any products using Motorola's Neuron(r)\nchips MC143150 or MC143120. If so, what are they and are they utilizing\nStandard Network Variable Types (SNVT)?\n________________________________________________________________________",12,sci.electronics


In [35]:
# For reference on regex: https://docs.python.org/3/library/re.html

# From 'content' column: 

# 1. Remove new line characters
df['content'] = df['content'].apply(lambda x: re.sub('\s+',' ', x))

# 2. Remove extra whitespace 
df['content'] = df['content'].apply(lambda x: ' '.join(x.split()))
# 3. Remove Emails
df['content'] = df['content'].apply(lambda x: re.sub('From: \S+@\S+','', x))

# 4. Remove non-alphanumeric characters
df['content'] = df['content'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))


  and should_run_async(code)
  df['content'] = df['content'].apply(lambda x: re.sub('\s+',' ', x))
  df['content'] = df['content'].apply(lambda x: re.sub('From: \S+@\S+','', x))


In [36]:
df.sample(3)

  and should_run_async(code)


Unnamed: 0,content,target,target_names
626,Could somebody explain to me what a centrifuge is and what it is used for I vaguely remembre it being something that spins test tubes around really fast but I cant remember why youd want to do that Purely recreational They get bored sitting in that rack all the time,13,sci.med
3393,How can you assume it was a sarcastic remark,10,rec.sport.hockey
5627,Hello I m trying to get X R running on my PC and ran into the following error message when trying to start the Xserver Setting TCP SO DONTLINGER Option not supported by protocol X Version X Windows System protocol Version revision vendor release Fatal server error no screens found giving up xinit software cased connection abort errno unable to connect to X xserver does anyone know what this error means has anyone experienced this problem help will be much appreciated thanks in advance please send replies to christy alex qc ca,5,comp.windows.x


In [37]:
nlp = spacy.load("en_core_web_lg")

  and should_run_async(code)


In [38]:
# Leverage tqdm for progress_apply
from tqdm import tqdm
tqdm.pandas()

# If you're on macOS, Linux, or python session executed from Windows Subsystem for Linux (WSL)
# conda activate U4-S1-NLP
# pip install pandarallel
#
# from pandarallel import pandarallel
# pandarallel.initialize(progress_bar=True)
#
# df['lemmas'] = df['content'].parallel_apply(get_lemmas)
#
# Ref: https://github.com/nalepae/pandarallel

  and should_run_async(code)


In [39]:
# Create 'lemmas' column
def get_lemmas(x):
    lemmas = []
    for token in nlp(x):
        if (token.is_stop!=True) and (token.is_punct!=True):
            lemmas.append(token.lemma_)
    return lemmas

df['lemmas'] = df['content'].progress_apply(get_lemmas)

  and should_run_async(code)
100%|████████████████████████████████████████████████████████████████████████████| 11314/11314 [04:58<00:00, 37.95it/s]


In [40]:
df.head()

  and should_run_async(code)


Unnamed: 0,content,target,target_names,lemmas
0,I was wondering if anyone out there could enlighten me on this car I saw the other day It was a door sports car looked to be from the late s early s It was called a Bricklin The doors were really small In addition the front bumper was separate from the rest of the body This is all I know If anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail,7,rec.autos,"[wonder, enlighten, car, see, day, , , door, sport, car, , look, late, , s, , early, , s, , call, Bricklin, , door, small, , addition, , bumper, separate, rest, body, , know, , tellme, model, , engine, spec, , year, production, , car, , history, , info, funky, look, car, , e, mail]"
1,A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll Please send a brief message detailing your experiences with the procedure Top speed attained CPU rated speed add on cards and adapters heat sinks hour of usage per day floppy disk functionality with and m floppies are especially requested I will be summarizing in the next two days so please add to the network knowledge base if you have done the clock upgrade and haven t answered this poll Thanks,4,comp.sys.mac.hardware,"[fair, number, brave, soul, upgrade, SI, clock, oscillator, share, experience, poll, , send, brief, message, detail, experience, procedure, , speed, attain, , cpu, rate, speed, , add, card, adapter, , heat, sink, , hour, usage, day, , floppy, disk, functionality, , , m, floppy, especially, request, , summarize, day, , add, network, knowledge, base, clock, upgrade, haven, t, answer, poll, , thank]"
2,well folks my mac plus finally gave up the ghost this weekend after starting life as a k way back in sooo i m in the market for a new machine a bit sooner than i intended to be i m looking into picking up a powerbook or maybe and have a bunch of questions that hopefully somebody can answer does anybody know any dirt on when the next round of powerbook introductions are expected i d heard the c was supposed to make an appearence this summer but haven t heard anymore on it and since i don t have access to macleak i was wondering if anybody out there had more info has anybody heard rumors about price drops to the powerbook line like the ones the duo s just went through recently what s the impression of the display on the i could probably swing a if i got the Mb disk rather than the but i don t really have a feel for how much better the display is yea it looks great in the store but is that all wow or is it really that good could i solicit some opinions of people who use the and day to day on if its worth taking the disk size and money hit to get the active display i realize this is a real subjective question but i ve only played around with the machines in a computer store breifly and figured the opinions of somebody who actually uses the machine daily might prove helpful how well does hellcats perform thanks a bunch in advance for any info if you could email i ll post a summary news reading time is at a premium with finals just around the corner Tom Willis twillis ecn purdue edu Purdue Electrical Engineering,4,comp.sys.mac.hardware,"[folk, , mac, plus, finally, give, ghost, weekend, start, life, , k, way, , sooo, , m, market, new, machine, bit, sooner, intend, , m, look, pick, powerbook, , maybe, , bunch, question, , hopefully, , somebody, answer, , anybody, know, dirt, round, powerbook, introduction, expect, , d, hear, , c, suppose, appearence, , summer, , haven, t, hear, anymore, , don, t, access, macleak, , wonder, anybody, info, , anybody, hear, rumor, price, drop, powerbook, line, like, one, duo, s, go, recently, , s, impression, display, , probably, swing, , get, , Mb, disk, , don, t, feel, , ...]"
3,Do you have Weitek s address phone number I d like to get some information about this chip,1,comp.graphics,"[Weitek, s, address, phone, number, , d, like, information, chip]"
4,From article C owCB n p world std com by tombaker world std com Tom A Baker My understanding is that the expected errors are basically known bugs in the warning system software things are checked that don t have the right values in yet because they aren t set till after launch and suchlike Rather than fix the code and possibly introduce new bugs they just tell the crew ok if you see a warning no before liftoff ignore it,14,sci.space,"[article, , C, owcb, n, p, world, std, com, , tombaker, world, std, com, , Tom, Baker, , understanding, , expect, error, , basically, know, bug, warning, system, software, , thing, check, don, t, right, value, aren, t, set, till, launch, , suchlike, , fix, code, possibly, introduce, new, bug, , tell, crew, , ok, , warning, , liftoff, , ignore, ]"


### The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

In [51]:
# Create Dictionary
id2word = corpora.Dictionary(df['lemmas'] )

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [52]:
# How many words do we have?
len(id2word.keys())

77754

In [53]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=5, no_above=0.75)

In [54]:
# How many words do we have?
len(id2word.keys())

14778

In [55]:
id2word[300]

'sheet'

In [56]:
df['content'][5]

'Of course  The term must be rigidly defined in any bill  I doubt she uses this term for that  You are using a quote allegedly from her  can you back it up  I read the article as presenting first an argument about weapons of mass destruction  as commonly understood  and then switching to other topics  The first point evidently was to show that not all weapons should be allowed  and then the later analysis was  given this understanding  to consider another class '

In [57]:
corpus[5]

[(0, 11),
 (117, 1),
 (177, 1),
 (193, 1),
 (221, 1),
 (225, 1),
 (226, 1),
 (227, 1),
 (228, 1),
 (229, 1),
 (230, 1),
 (231, 1),
 (232, 1),
 (233, 1),
 (234, 1),
 (235, 1),
 (236, 1),
 (237, 1),
 (238, 1),
 (239, 1),
 (240, 1),
 (241, 1),
 (242, 1),
 (243, 1),
 (244, 1),
 (245, 1),
 (246, 2),
 (247, 1),
 (248, 1),
 (249, 2)]

In [49]:
id2word[252]

'rm'

In [58]:
id2word[276]

'controller'

In [None]:
# Human readable format of corpus (term-frequency)
[(id2word[word_id], word_count) for word_id, word_count in corpus[5]]

# Part 2: Estimate a LDA Model with Gensim

 ### Train an LDA model

In [None]:
%%time
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=20, 
                                            chunksize=100,
                                            passes=10,
                                            per_word_topics=True)

# https://radimrehurek.com/gensim/models/ldamodel.html

In [None]:
# lda_model.save('lda_model.model')

In [None]:
%%time
lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=20, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)

# https://radimrehurek.com/gensim/models/ldamulticore.html

In [None]:
lda_multicore.save('lda_multicore.model')

In [None]:
from gensim import models
lda_multicore =  models.LdaModel.load('lda_multicore.model')

### View the topics in LDA model

In [None]:
newsgroups_train.target_names

In [None]:
pprint(lda_multicore.print_topics())
doc_lda = lda_multicore[corpus]

In [None]:
doc_lda

In [None]:
distro = [lda[d] for d in corpus]

### What is topic Perplexity?
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore, 
                                     texts=df['lemmas'], 
                                     dictionary=id2word, 
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Part 3: Interpret LDA results & Select the appropriate number of topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
pyLDAvis.display(vis)

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
%%time
model_list, coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        texts=df['lemmas'], 
                                                        start=2, 
                                                        limit=40, 
                                                        step=6)

In [None]:
coherence_values = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
#optimal_model = model_list[4]
optimal_model =  models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))