In [1]:
%matplotlib inline


LDA Model
=========

Introduces Gensim's LDA model and demonstrates its use on the NIPS corpus.


In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In this tutorial we will:

* Load data.
* Pre-process data.
* Transform documents to a vectorized form.
* Train an LDA model.

If you are not familiar with the LDA model or how to use it in Gensim, I
suggest you read up on that before continuing with this tutorial. Basic
understanding of the LDA model should suffice. Examples:

* `Introduction to Latent Dirichlet Allocation <http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation>`_
* Gensim tutorial: `sphx_glr_auto_examples_core_run_topics_and_transformations.py`
* Gensim's LDA model API docs: :py:class:`gensim.models.LdaModel`


Data: 1740 NIPS papers
.. Important::
    The corpus contains 1740 documents, and not particularly long ones.
    So keep in mind that this tutorial is not geared towards efficiency, and be
    careful before applying the code to a large dataset.




## We'll try with cleaned job data first

In [93]:
import io
import os.path
import re
import tarfile

import pandas as pd

import smart_open



clean_jobs_path = '/Users/richardkuzma/coding/NLP_projects/job_recommender_project/data/cleaned_job_posts_madhab.csv'
clean_resumes_path = '/Users/richardkuzma/coding/NLP_projects/job_recommender_project/data/cleaned_resume_dataset_maitrip.csv'

resumes = pd.read_csv(clean_resumes_path)

resumes.head()


# def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
#     fname = url.split('/')[-1]
    
#     # Download the file to local storage first.
#     # We can't read it on the fly because of 
#     # https://github.com/RaRe-Technologies/smart_open/issues/331
#     if not os.path.isfile(fname):
#         with smart_open.open(url, "rb") as fin:
#             with smart_open.open(fname, 'wb') as fout:
#                 while True:
#                     buf = fin.read(io.DEFAULT_BUFFER_SIZE)
#                     if not buf:
#                         break
#                     fout.write(buf)
                         
#     with tarfile.open(fname, mode='r:gz') as tar:
#         # Ignore directory entries, as well as files like README, etc.
#         files = [
#             m for m in tar.getmembers()
#             if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
#         ]
#         for member in sorted(files, key=lambda x: x.name):
#             member_bytes = tar.extractfile(member).read()
#             yield member_bytes.decode('utf-8', errors='replace')

# docs = list(extract_documents())

Unnamed: 0,ID,Category,dirty_resume,resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box...",john h smith phr po box callahan fl infog...
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...,name surname address mobile noemail personal p...
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...,anthony brown hr assistant areas expertise per...
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...,satheesh email id career objective pursue gro...
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert...",human resources director expert organizational...


So we have a list of 1740 documents, where each document is a Unicode string. 
If you're thinking about using your own corpus, then you need to make sure
that it's in the same format (list of Unicode strings) before proceeding
with the rest of this tutorial.




In [94]:
print(len(resumes))
print(resumes['resume'][0][:500])

1219
john h smith phr    po box  callahan fl  infogreatresumesfastcom approachable innovator passion human resources senior human resources professional personable analytical flexible senior hr professional multifaceted expertise seasoned benefits administrator extensive experience working highly paid professionals client relationship based settings dynamic team leader capable analyzing alternatives identifying tough choices communicating total value benefit compensation packages senior level executi


Pre-process and vectorize the documents
---------------------------------------

As part of preprocessing, we will:

* Tokenize (split the documents into tokens).
* Lemmatize the tokens.
* Compute bigrams.
* Compute a bag-of-words representation of the data.

First we tokenize the text using a regular expression tokenizer from NLTK. We
remove numeric tokens and tokens that are only a single character, as they
don't tend to be useful, and the dataset contains a lot of them.

.. Important::

   This tutorial uses the nltk library for preprocessing, although you can
   replace it with something else if you want.




In [95]:
# Tokenize the documents.
import nltk

#remove numbers
resumes['resume'].str.replace('\d+', '')

#didn't remove these in the cleaning_data notebook
resumes['resume'].str.replace('xefxxb', ' ')
resumes['resume'].str.replace('xexxa', ' ')
resumes['tokenized_resume'] = resumes.apply(lambda row: nltk.word_tokenize(str(row['resume'])), axis=1)


In [96]:
resumes.head()

Unnamed: 0,ID,Category,dirty_resume,resume,tokenized_resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box...",john h smith phr po box callahan fl infog...,"[john, h, smith, phr, po, box, callahan, fl, i..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...,name surname address mobile noemail personal p...,"[name, surname, address, mobile, noemail, pers..."
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...,anthony brown hr assistant areas expertise per...,"[anthony, brown, hr, assistant, areas, experti..."
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...,satheesh email id career objective pursue gro...,"[satheesh, email, id, career, objective, pursu..."
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert...",human resources director expert organizational...,"[human, resources, director, expert, organizat..."


In [97]:
count = 0
for i in resumes['tokenized_resume'][0]:
    print(i, " ", count)
    count +=1

john   0
h   1
smith   2
phr   3
po   4
box   5
callahan   6
fl   7
infogreatresumesfastcom   8
approachable   9
innovator   10
passion   11
human   12
resources   13
senior   14
human   15
resources   16
professional   17
personable   18
analytical   19
flexible   20
senior   21
hr   22
professional   23
multifaceted   24
expertise   25
seasoned   26
benefits   27
administrator   28
extensive   29
experience   30
working   31
highly   32
paid   33
professionals   34
client   35
relationship   36
based   37
settings   38
dynamic   39
team   40
leader   41
capable   42
analyzing   43
alternatives   44
identifying   45
tough   46
choices   47
communicating   48
total   49
value   50
benefit   51
compensation   52
packages   53
senior   54
level   55
executives   56
employees   57
core   58
competencies   59
benefits   60
administration   61
customer   62
service   63
cost   64
control   65
recruiting   66
acquisition   67
management   68
compliance   69
reporting   70
retention   71
prof

We use the WordNet lemmatizer from NLTK. A lemmatizer is preferred over a
stemmer in this case because it produces more readable words. Output that is
easy to read is very desirable in topic modelling.




In [98]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_text(pandas_series):
    return [lemmatizer.lemmatize(token) for token in pandas_series]

resumes['lemmatized_resume'] = resumes['tokenized_resume'].apply(lemmatize_text)

In [99]:
resumes.head()

Unnamed: 0,ID,Category,dirty_resume,resume,tokenized_resume,lemmatized_resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box...",john h smith phr po box callahan fl infog...,"[john, h, smith, phr, po, box, callahan, fl, i...","[john, h, smith, phr, po, box, callahan, fl, i..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...,name surname address mobile noemail personal p...,"[name, surname, address, mobile, noemail, pers...","[name, surname, address, mobile, noemail, pers..."
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...,anthony brown hr assistant areas expertise per...,"[anthony, brown, hr, assistant, areas, experti...","[anthony, brown, hr, assistant, area, expertis..."
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...,satheesh email id career objective pursue gro...,"[satheesh, email, id, career, objective, pursu...","[satheesh, email, id, career, objective, pursu..."
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert...",human resources director expert organizational...,"[human, resources, director, expert, organizat...","[human, resource, director, expert, organizati..."


We find bigrams in the documents. Bigrams are sets of two adjacent words.
Using bigrams we can get phrases like "machine_learning" in our output
(spaces are replaced with underscores); without bigrams we would only get
"machine" and "learning".

Note that in the code below, we find bigrams and then add them to the
original data, because we would like to keep the words "machine" and
"learning" as well as the bigram "machine_learning".

.. Important::
    Computing n-grams of large dataset can be very computationally
    and memory intensive.




In [100]:
docs = resumes['lemmatized_resume'].values.tolist()
docs[2][:10]

['anthony',
 'brown',
 'hr',
 'assistant',
 'area',
 'expertise',
 'personal',
 'summary',
 'hr',
 'process']

In [101]:
resumes['resume'][2]




'anthony brown hr assistant areas expertise personal summary hr processes systems competent organised individual able work part team manage several priorities one time anthony positive attitude strong work ethic keen desire learn grow within firm possesses superb communications skills always treats people respect according individual needs dedicated professional fully understands importance hr department organisation therefore aims make office works effective efficient possible extensive experience working commercially focussed organisations fully understands pressures achieving targets accurately assessing job applicants according ability contract document generation accepting resignations business administration note taking right would like work friendly exciting company looking hr assistant reflect values excellence quality recruitment methodologies career history employment legislation answering queries document management equal opportunities absence management calendar management 

In [102]:
# docs2 = docs.copy()
# requires_removal = ["xexxa", "xefxxa", "xefxxb", "xefxx", "xx", "xcx", "x", "xcxa"]

# # for i in docs2:
# #     i = [ token for token in i if token not in requires_removal ]

# for i in docs2:
#     for lem_token in i:
#         if lem_token in requires_removal:
#             i.remove(lem_token)



In [107]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

2020-04-15 08:08:06,151 : INFO : collecting all words and their counts
2020-04-15 08:08:06,170 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-04-15 08:08:08,109 : INFO : collected 424500 word types from a corpus of 763152 words (unigram + bigrams) and 1219 sentences
2020-04-15 08:08:08,110 : INFO : using 424500 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>


We remove rare words and common words based on their *document frequency*.
Below we remove words that appear in less than 20 documents or in more than
50% of the documents. Consider trying to remove words only based on their
frequency, or maybe combining that with this approach.




In [108]:
#doing this on the lemmatized resume corpus discards 39k tokens and keeps 3800 tokens...



# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

2020-04-15 08:08:30,459 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-15 08:08:31,833 : INFO : built Dictionary(42400 unique tokens: ['a', 'aap', 'abc', 'account', 'achieve']...) from 1219 documents (total 820614 corpus positions)
2020-04-15 08:08:31,996 : INFO : discarding 38581 tokens: [('aap', 3), ('acme', 13), ('adaptability', 8), ('admin', 19), ('adp', 16), ('affirmative', 4), ('approachable', 6), ('callahan', 15), ('cebs', 2), ('cognos', 10)]...
2020-04-15 08:08:31,997 : INFO : keeping 3819 tokens which were in no less than 20 and no more than 609 (=50.0%) documents
2020-04-15 08:08:32,021 : INFO : resulting dictionary: Dictionary(3819 unique tokens: ['a', 'abc', 'account', 'achieve', 'acquired']...)


Finally, we transform the documents to a vectorized form. We simply compute
the frequency of each word, including the bigrams.




In [109]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Let's see how many tokens and documents we have to train on.




In [110]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 3819
Number of documents: 1219


In [111]:
# I wonder how that will affect things. Maybe limited tokens better for honing in on topic analysis. 
# same culling useful for embedding?

Training
--------

We are ready to train the LDA model. We will first discuss how to set some of
the training parameters.

First of all, the elephant in the room: how many topics do I need? There is
really no easy answer for this, it will depend on both your data and your
application. I have used 10 topics here because I wanted to have a few topics
that I could interpret and "label", and because that turned out to give me
reasonably good results. You might not need to interpret all your topics, so
you could use a large number of topics, for example 100.

``chunksize`` controls how many documents are processed at a time in the
training algorithm. Increasing chunksize will speed up training, at least as
long as the chunk of documents easily fit into memory. I've set ``chunksize =
2000``, which is more than the amount of documents, so I process all the
data in one go. Chunksize can however influence the quality of the model, as
discussed in Hoffman and co-authors [2], but the difference was not
substantial in this case.

``passes`` controls how often we train the model on the entire corpus.
Another word for passes might be "epochs". ``iterations`` is somewhat
technical, but essentially it controls how often we repeat a particular loop
over each document. It is important to set the number of "passes" and
"iterations" high enough.

I suggest the following way to choose iterations and passes. First, enable
logging (as described in many Gensim tutorials), and set ``eval_every = 1``
in ``LdaModel``. When training the model look for a line in the log that
looks something like this::

   2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations

If you set ``passes = 20`` you will see this line 20 times. Make sure that by
the final passes, most of the documents have converged. So you want to choose
both passes and iterations to be high enough for this to happen.

We set ``alpha = 'auto'`` and ``eta = 'auto'``. Again this is somewhat
technical, but essentially we are automatically learning two parameters in
the model that we usually would have to specify explicitly.




In [112]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

2020-04-15 08:08:53,795 : INFO : using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2020-04-15 08:08:53,807 : INFO : using serial LDA version on this node
2020-04-15 08:08:53,820 : INFO : running online (multi-pass) LDA training, 10 topics, 20 passes over the supplied corpus of 1219 documents, updating model once every 1219 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2020-04-15 08:08:53,822 : INFO : PROGRESS: pass 0, at document #1219/1219
2020-04-15 08:09:08,289 : INFO : optimized alpha [0.04740873, 0.050656203, 0.06904058, 0.06628601, 0.06795424, 0.06668646, 0.07879429, 0.09886372, 0.055140518, 0.07347201]
2020-04-15 08:09:08,296 : INFO : topic #0 (0.047): 0.012*"student" + 0.008*"school" + 0.006*"research" + 0.005*"program" + 0.004*"engineering" + 0.004*"design" + 0.004*"high" + 0.004*"science" + 0.004*"academic" + 0.004*"marketing"
2020-04-15 08:09:08,298 : INFO : topic #1 (0.051): 

2020-04-15 08:09:32,804 : INFO : topic #0 (0.035): 0.016*"student" + 0.011*"school" + 0.010*"science" + 0.009*"engineering" + 0.008*"research" + 0.008*"program" + 0.007*"fall" + 0.006*"spring" + 0.006*"dental" + 0.006*"pa"
2020-04-15 08:09:32,805 : INFO : topic #5 (0.052): 0.013*"design" + 0.008*"system" + 0.006*"employee" + 0.006*"technology" + 0.005*"hr" + 0.005*"manager" + 0.005*"web" + 0.005*"training" + 0.005*"client" + 0.004*"designer"
2020-04-15 08:09:32,806 : INFO : topic #7 (0.067): 0.009*"student" + 0.006*"position" + 0.005*"may" + 0.005*"engineering" + 0.005*"school" + 0.005*"job" + 0.005*"science" + 0.004*"information" + 0.004*"use" + 0.004*"career"
2020-04-15 08:09:32,808 : INFO : topic #2 (0.075): 0.014*"business" + 0.009*"customer" + 0.009*"financial" + 0.009*"sale" + 0.007*"manager" + 0.007*"client" + 0.006*"account" + 0.006*"accounting" + 0.005*"system" + 0.005*"process"
2020-04-15 08:09:32,809 : INFO : topic diff=0.281732, rho=0.377964
2020-04-15 08:09:32,818 : INFO :

2020-04-15 08:09:50,849 : INFO : optimized alpha [0.033748176, 0.030254524, 0.08719951, 0.03239606, 0.040737182, 0.05186914, 0.04079399, 0.06098248, 0.046475705, 0.037928913]
2020-04-15 08:09:50,880 : INFO : topic #1 (0.030): 0.024*"health" + 0.017*"j" + 0.013*"research" + 0.013*"medical" + 0.013*"r" + 0.011*"care" + 0.011*"patient" + 0.010*"conference" + 0.009*"department" + 0.009*"hospital"
2020-04-15 08:09:50,883 : INFO : topic #3 (0.032): 0.016*"construction" + 0.013*"xexxf" + 0.009*"chicago" + 0.007*"legal" + 0.007*"engineering" + 0.006*"engineer" + 0.006*"il" + 0.006*"chicago_il" + 0.006*"equipment" + 0.005*"safety"
2020-04-15 08:09:50,887 : INFO : topic #5 (0.052): 0.014*"design" + 0.007*"hr" + 0.006*"employee" + 0.006*"system" + 0.006*"technology" + 0.005*"training" + 0.005*"web" + 0.005*"client" + 0.005*"designer" + 0.005*"manager"
2020-04-15 08:09:50,905 : INFO : topic #7 (0.061): 0.010*"student" + 0.007*"position" + 0.006*"may" + 0.006*"job" + 0.005*"school" + 0.005*"use" + 

2020-04-15 08:10:10,048 : INFO : topic #2 (0.096): 0.016*"business" + 0.010*"sale" + 0.010*"customer" + 0.009*"financial" + 0.009*"manager" + 0.007*"client" + 0.007*"account" + 0.006*"accounting" + 0.005*"training" + 0.005*"process"
2020-04-15 08:10:10,050 : INFO : topic diff=0.129853, rho=0.235702
2020-04-15 08:10:10,063 : INFO : PROGRESS: pass 17, at document #1219/1219
2020-04-15 08:10:12,894 : INFO : optimized alpha [0.034365945, 0.030192683, 0.09701942, 0.03277409, 0.041131556, 0.053586993, 0.03944895, 0.059586506, 0.049859904, 0.036506534]
2020-04-15 08:10:12,900 : INFO : topic #1 (0.030): 0.026*"health" + 0.017*"j" + 0.013*"research" + 0.013*"medical" + 0.013*"r" + 0.012*"care" + 0.011*"patient" + 0.010*"conference" + 0.009*"hospital" + 0.009*"department"
2020-04-15 08:10:12,902 : INFO : topic #3 (0.033): 0.018*"construction" + 0.014*"xexxf" + 0.009*"chicago" + 0.009*"engineering" + 0.008*"engineer" + 0.008*"equipment" + 0.008*"safety" + 0.006*"il" + 0.006*"chicago_il" + 0.006*"

We can compute the topic coherence of each topic. Below we display the
average topic coherence and print the topics in order of topic coherence.

Note that we use the "Umass" topic coherence measure here (see
:py:func:`gensim.models.ldamodel.LdaModel.top_topics`), Gensim has recently
obtained an implementation of the "AKSW" topic coherence measure (see
accompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).

If you are familiar with the subject of the articles in this dataset, you can
see that the topics below make a lot of sense. However, they are not without
flaws. We can see that there is substantial overlap between some topics,
others are hard to interpret, and most of them have at least some terms that
seem out of place. If you were able to do better, feel free to share your
methods on the blog at http://rare-technologies.com/lda-training-tips/ !




In [119]:
top_topics = model.top_topics(corpus) #, num_words=20)


# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
print('\ntop topics:\n')
pprint(top_topics)

2020-04-15 08:14:07,589 : INFO : CorpusAccumulator accumulated stats from 1000 documents


Average topic coherence: -1.1315.

top topics:

[([(0.024670428, 'system'),
   (0.013092519, 'software'),
   (0.01171928, 'application'),
   (0.011514891, 'design'),
   (0.00971705, 'data'),
   (0.009472821, 'web'),
   (0.0081805475, 'using'),
   (0.007951193, 'technology'),
   (0.007663341, 'c'),
   (0.007572099, 'business'),
   (0.0074960017, 'computer'),
   (0.006469897, 'user'),
   (0.006010863, 'technical'),
   (0.005859585, 'test'),
   (0.005306984, 'developed'),
   (0.0052748295, 'programming'),
   (0.005208697, 'analysis'),
   (0.0051359185, 'engineering'),
   (0.0050469283, 'requirement'),
   (0.005031191, 'window')],
  -0.8586092453651734),
 ([(0.009472255, 'student'),
   (0.007460938, 'position'),
   (0.006149306, 'job'),
   (0.005903394, 'may'),
   (0.0053736507, 'use'),
   (0.0053210435, 'school'),
   (0.00479321, 'career'),
   (0.004692284, 'information'),
   (0.004588054, 'letter'),
   (0.004531643, 'list'),
   (0.0045027323, 'name'),
   (0.004373066, 'include'),
   (0.0

In [121]:
pprint(model.print_topics())
doc_lda = model[corpus]

2020-04-15 08:18:38,753 : INFO : topic #0 (0.035): 0.018*"student" + 0.015*"engineering" + 0.014*"science" + 0.009*"school" + 0.008*"research" + 0.007*"fall" + 0.007*"society" + 0.007*"program" + 0.007*"may" + 0.007*"pa"
2020-04-15 08:18:38,757 : INFO : topic #1 (0.030): 0.027*"health" + 0.017*"j" + 0.013*"research" + 0.013*"medical" + 0.013*"r" + 0.012*"care" + 0.011*"patient" + 0.010*"conference" + 0.009*"hospital" + 0.009*"department"
2020-04-15 08:18:38,760 : INFO : topic #2 (0.100): 0.016*"business" + 0.011*"sale" + 0.010*"customer" + 0.009*"financial" + 0.009*"manager" + 0.007*"client" + 0.007*"account" + 0.006*"accounting" + 0.005*"training" + 0.005*"process"
2020-04-15 08:18:38,761 : INFO : topic #3 (0.033): 0.018*"construction" + 0.014*"xexxf" + 0.009*"engineering" + 0.009*"chicago" + 0.009*"engineer" + 0.008*"equipment" + 0.008*"safety" + 0.006*"chicago_il" + 0.006*"il" + 0.006*"design"
2020-04-15 08:18:38,763 : INFO : topic #4 (0.041): 0.025*"system" + 0.013*"software" + 0.0

[(0,
  '0.018*"student" + 0.015*"engineering" + 0.014*"science" + 0.009*"school" + '
  '0.008*"research" + 0.007*"fall" + 0.007*"society" + 0.007*"program" + '
  '0.007*"may" + 0.007*"pa"'),
 (1,
  '0.027*"health" + 0.017*"j" + 0.013*"research" + 0.013*"medical" + 0.013*"r" '
  '+ 0.012*"care" + 0.011*"patient" + 0.010*"conference" + 0.009*"hospital" + '
  '0.009*"department"'),
 (2,
  '0.016*"business" + 0.011*"sale" + 0.010*"customer" + 0.009*"financial" + '
  '0.009*"manager" + 0.007*"client" + 0.007*"account" + 0.006*"accounting" + '
  '0.005*"training" + 0.005*"process"'),
 (3,
  '0.018*"construction" + 0.014*"xexxf" + 0.009*"engineering" + '
  '0.009*"chicago" + 0.009*"engineer" + 0.008*"equipment" + 0.008*"safety" + '
  '0.006*"chicago_il" + 0.006*"il" + 0.006*"design"'),
 (4,
  '0.025*"system" + 0.013*"software" + 0.012*"application" + 0.012*"design" + '
  '0.010*"data" + 0.009*"web" + 0.008*"using" + 0.008*"technology" + 0.008*"c" '
  '+ 0.008*"business"'),
 (5,
  '0.015*"desi

In [125]:

# Compute Perplexity
print('\nPerplexity: ', model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.


2020-04-15 08:21:28,041 : INFO : -7.281 per-word bound, 155.5 perplexity estimate based on a held-out corpus of 1219 documents with 594829 words



Perplexity:  -7.28111608707997


In [126]:
### doesn't work...

# Compute Coherence Score
from gensim.models import CoherenceModel

coherence_model_lda = CoherenceModel(model=model, texts=corpus, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

2020-04-15 08:21:28,060 : INFO : using ParallelWordOccurrenceAccumulator(processes=3, batch_size=64) to estimate probabilities from sliding windows
2020-04-15 08:21:28,183 : INFO : serializing accumulator to return to master...
2020-04-15 08:21:28,183 : INFO : serializing accumulator to return to master...
2020-04-15 08:21:28,182 : INFO : serializing accumulator to return to master...
2020-04-15 08:21:28,191 : INFO : accumulator serialized
2020-04-15 08:21:28,192 : INFO : accumulator serialized
2020-04-15 08:21:28,191 : INFO : accumulator serialized
2020-04-15 08:21:28,248 : INFO : 3 accumulators retrieved from output queue
2020-04-15 08:21:28,272 : INFO : accumulated word occurrence stats for 0 virtual documents



Coherence Score:  nan


In [129]:
import pyLDAvis
import pyLDAvis.gensim

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model, corpus, dictionary)
vis

2020-04-15 08:24:02,636 : INFO : NumExpr defaulting to 4 threads.


Things to experiment with
-------------------------

* ``no_above`` and ``no_below`` parameters in ``filter_extremes`` method.
* Adding trigrams or even higher order n-grams.
* Consider whether using a hold-out set or cross-validation is the way to go for you.
* Try other datasets.

Where to go from here
---------------------

* Check out a RaRe blog post on the AKSW topic coherence measure (http://rare-technologies.com/what-is-topic-coherence/).
* pyLDAvis (https://pyldavis.readthedocs.io/en/latest/index.html).
* Read some more Gensim tutorials (https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).
* If you haven't already, read [1] and [2] (see references).

References
----------

1. "Latent Dirichlet Allocation", Blei et al. 2003.
2. "Online Learning for Latent Dirichlet Allocation", Hoffman et al. 2010.


