# Topic Modeling ##


## What is Topic Modeling? ##

What is topic modeling? At its most basic level, topic modeling is an automated method for extracting the themes, or "topics," from large sets of documents--like GPT-3 generated fiction, or Yelp reviews, or as we'll explore today, articles in the Emory Wheel.

There are numerous kinds of topic models, but the most popular and widely-used kind is latent Dirichlet allocation (LDA). It's so popular, in fact, that "LDA" and "topic model" are sometimes used interchangeably, even though LDA is only one type.

LDA math is pretty complicated. We're not going to get very deep into the math just yet (or maybe not ever, depending on the time). But first we are going to introduce two important concepts that will help us conceptually understand how LDA topic models work.

### 1) LDA is an Unsupervised Algorithm 
Topic modeling is a kind of machine learning. Machine learning always sounds complicated, but it really just means that computer algorithms are performing tasks without being explicitly programmed to do so and that they are "learning" how to perform these tasks by being fed training data. In the field of machine learning, algorithms are typically split into two broad categories: supervised and unsupervised. These categories describe how the algorithms are "trained" or how they "learn." LDA is an unsupervised algorithm.

If an algorithm is supervised, that means a researcher is helping to guide it with some kind of information, like labels. For example, if you wanted to create an algorithm that could identify pictures of cats vs pictures of dogs, you could train it with a bunch of pictures of cats that were clearly labeled CAT and a bunch of pictures of dogs that were clearly labeled DOG. The algorithm would then be able to learn which features are specific to cats vs dogs because you explicitly told it: this is a picture of a cat; this is a picture of a dog.

If an algorithm is unsupervised, that means a researcher does not train it with outside information. There are no labels. The algorithm just learns that pictures of cats are more similar to each other and pictures of dogs are more similar to each other. The algorithm doesn't really know that one cluster is cats and one cluster is dogs; it just knows that there are two distinct clusters.

Because LDA is an unsupervised algorithm, we don't tell our topic model which words or topics to look for. We only tell the topic model how many topics (or clusters of words) that we want returned. The topic model doesn't know anything about Frida Kahlo, Nella Larsen, and Jackie Robinson. It doesn't know anything about art, literature, and sports.

### 2) LDA is a Probabilistic Model 
LDA fundamentally relies on statistics and probabilities. Rather than calculating precise and unchanging metrics about a given corpus, as we've done thus far, a topic model makes a series of very sophisticated guesses about the corpus. These guesses will change slightly every time we run the topic model. This is important to remember as we analyze, interpret, and make arguments based on our results. All of our results in this lesson will be probabilities, and they'll change slightly every time we re-run the topic model.

When we tell the topic model that we want to extract 15 topics from the Emory Wheel, here's what the topic model does:

The topic model starts off with a slightly silly, backwards assumption. The topic model assumes that every single one of the 4000-some-odd articles in the corpus was written by someone who exclusively drew their words from 15 mystery topics, or 15 clusters of words. To spin it in a slightly different way with a different medium, the topic model assumes that there was one master artist with 15 different paints on her palette, who created all the articles by dipping her brush into these 15 paints alone, applying and blending them onto each canvas in different proportions. The topic model is trying to discover the 15 mystery topics that created all the Wheel articles, as well as the mixture of these topics that makes up each individual article.

The topic model begins by taking a completely wild guess about the 15 topics, but then it iterates through all the words in all the article and makes better and better guesses. If the word "student" keeps showing up with the words "stress" and "exam," and if all three words keep showing up in the same kinds of article, then the topic model starts to suspect that these three words should belong to the same topic. If the word "film" keeps showing up with "Atlanta" and "industry," then the topic model suspects that they should belong to the same topic, too. The topic model finally arrives at its best guesses for the 15 topics that most likely created all the Emory Wheel articles.


## LDA explained again in more abstract terms

Probabilistic topic models begin with an assumption and a definition. 

The assumption: all documents contain a mixture of different topics.

The definition: a topic is a collection of words, each with a different probability of occurance in a particular document (or other chunk of text) discussing that topic. 




Here's a nice illustration, created by Ted Underwood, that shows this assumed relatioship between topics and documents. 

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

Above we see an example of the basic assumption of topic modeling: one topic might contain many occurrences of “organize,” “committee,” “direct,” and “lead.” Another might contain a lot of “mercury” and “arsenic,” with a few occurrences of “lead.” 

The three documents are assumed to contain both topics in different proportions.

But here is the thing: we can’t directly observe topics. All we actually have are the documents that attest to their existence. So in other words:

**Topic modeling is a way of extrapolating backward from a collection of documents to infer the topics that could have generated them.** 

There is simply no way to infer the exact topics in a set of documents; there are too many unknowns. So (probabalistic) topic modeling works backwards. It pretends that the problem is mostly solved. 

**How does this play out in actual life?**

Suppose we knew which topic produced every word in the collection, except for this one word in document D. The word happens to be “lead,” which we’ll call word type W. How are we going to decide whether this occurrence of W belongs to topic 1 or topic 2?

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

We can’t know for sure. But one way to guess is to consider two questions. This is the first: 

* How often does “lead” appear in topic 1 elsewhere? If “lead” often occurs in discussions of 1, then this instance of “lead” might belong to 1 as well. 

But a word can be common in more than one topic, as it is in topics 1 and 2 above. And we don’t want to assign “lead” to a topic about leadership (topic 1) if this document is mostly about heavy metal contamination (topic 2). So we also need to consider a second question:

* How common is topic 1 in the rest of the document?

To answer these questions, here’s what we’ll do:

For each possible topic Z, we’ll multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z. The result will represent the probability that this word came from Z. Here’s the actual formula:

![LDA formula](https://tedunderwood.files.wordpress.com/2012/04/ldaformula.png)

There are also a few Greek letters scattered in there, but they aren’t important for our purposes. Technically, they’re called “hyperparameters,” but you can think of them simply as fudge factors. 

In other words: there’s some chance that this word belongs to topic Z even if it is nowhere else associated with Z; the fudge factors keep that possibility open. (If you want to understand hyperparameters beyond the "fudge factor" explanation, see "[Rethinking LDA: Why Priors Matter](http://dirichlet.net/pdf/wallach09rethinking.pdf).")

The overall emphasis on probability in this technique, of course, is why it’s called *probabilistic topic modeling*.

### Enter Sampling ###

Now, suppose that instead of having the problem mostly solved, we had only a wild guess which word belonged to which topic. We could still use the strategy I've just described to improve our guess, by making it more internally consistent. 

We could go through the collection, word by word, and reassign each word to a topic, guided by the formula above. 

And in fact, that's what LDA actually does.

And as we do that, two things happen:

1) Words will gradually become more common in topics where they are already common. And also,

2) Topics will become more common in documents where they are already common. 

Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. So the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.

For a slightly more in depth explanation of how LDA works, see [this video](https://vimeo.com/53080123). (Start around 5:35). 

### A brief historical / technical digression... ###

Topic modeling began as a US military project in the in the 1990s. The goal was to automatically detect changes in newswire text so that governmental and military organizations could be alerted to emerging geopolitical events. (For more on this history, see [Binder](https://dhdebates.gc.cuny.edu/read/untitled/section/4b276a04-c110-4cba-b93d-4ded8fcfafc9#ch18).)


In the early 2000s, a team of computer science researchers released [MALLET](http://mallet.cs.umass.edu/topics.php), short for **MA**chine **L**earning for **L**anguag**E** **T**oolkit. As the name suggests, MALLET is a software toolkit that enables a range of NLP techniques. Today, people mostly only use it for topic modeling, which it remains very very good at.

With that said, MALLET is written in Java, which means that it's not ideal for working in Python and Colab notebooks. Maria Antoniak, whose "Birth Stories" paper we've read, has written a convenient Python package that allows you to use MALLET in a Colab notebook. Her package is called [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), and you should think about using that if you end up using topic modeling as one of your methods for your final project.

For today, though, we'll be using the slightly less accurate but decidedly easier to install [gensim](https://radimrehurek.com/gensim/about.html), a native Python library for topic modeling tht was created in the early 2010s by a computer science PhD student, Radim Rehurek. Its ease of use has made the use of topic models explode--although, I should reiterate, most people end up returning to MALLET for research-level code. 

## Let's go! ##

In [8]:
# import and setup modules we'll be using in this notebook
import logging # for logging status etc
import itertools # helpful library for iterating through things

import numpy as np # this is a powerful python math package that many others are based on
import gensim # our topic modeling library
import os # for file i/o

# configure logging, since topic modeling takes a while and it's good to know what's going on 
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  

# a helpful function that returns the first `n` elements of the stream as plain list.
# we'll use this later
def head(stream, n=10):
    return list(itertools.islice(stream, n))

In [9]:
# import some more gensim modules for processing the corpus
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

## Load in our data

As in the TF-IDF notebooks, we'll be using the Emory Wheel corpus. The  documents are individual .txt files that are stored in a zip file on my Google Drive. Below is code similar--but not identical--to the preivous notebook, which gets the data from Google Drive and unzips it into a folder, which is further processed below.

In [10]:
# For downloading large files from Google Drive
# https://github.com/wkentaro/gdown
import gdown

# then download the zip file
gdown.download('https://drive.google.com/uc?export=download&id=1SUWUVswaY_RDLhzFruQIDJe-i6I3gznC', quiet=False) 

# unzip it 
!unzip wheel-clean.zip

Downloading...
From: https://drive.google.com/uc?export=download&id=1SUWUVswaY_RDLhzFruQIDJe-i6I3gznC
To: /content/wheel-clean.zip
100%|██████████| 9.36M/9.36M [00:00<00:00, 71.9MB/s]


Archive:  wheel-clean.zip
   creating: wheel-clean/
  inflating: wheel-clean/2017-09-06-White-Supremacist-Symbol-Tarnishes-Old-DeKalb-Courthouse.txt  
  inflating: wheel-clean/2017-09-13-Let-the-Games-Begin-Taylor-Swift-Satirizes-her-Reputation-in-NewSingles.txt  
  inflating: wheel-clean/2014-11-11-Crime-Report-11-11-14.txt  
  inflating: wheel-clean/2017-10-18-Doolino-Knows-Best-Falling-Apart.txt  
  inflating: wheel-clean/2016-11-23-Former-U-S-Poet-Laureate-to-Leave-Emory-for-Northwestern.txt  
  inflating: wheel-clean/2019-09-14-DeKalb-County-Issues-Boil-Water-Advisory.txt  
  inflating: wheel-clean/2015-10-24--Bob-s-Burgers-Going-Steady-Into-Season-Six.txt  
  inflating: wheel-clean/2016-03-21-The-Best-Batman-Film-is-One-You-Have-Probably-Never-Seen.txt  
  inflating: wheel-clean/2018-10-10-Great-Times-Await-at-Bad-Times-at-the-El-Royale-.txt  
  inflating: wheel-clean/2015-10-01-College-Is-Worth-More-Than-Advertised.txt  
  inflating: wheel-clean/2017-09-06-Emory-Doctors-Aid-Refu

## Processing the data 

As I've emphaszied throughout this course, there is always some sort of pre-processing involved in running any text analysis method. However, the particulars of this step are always a little different. 

To generate a topic model using gensim, each document needs to be fed into the model in the fromat (title, tokens). 

In the next few cells, we'll write a function to tokenize our documents, and then we'll use that function to pre-process our documents. 

### Step 1: Define our tokenizing function ###

As previously discussed, gensim requires that we first tokenize our corpus. 

We've used tokenziers built into spaCy already for this course. Here, however, we're going to write our own quick tokenizing function that makes use of gensim's [simple_preprocess function](https://radimrehurek.com/gensim/utils.html), which breaks a document into a list of lowercase tokens. The lower-casing is important for topic modeling since we want both uppercase and lowercase versions of the same word to be counted together. 

In [11]:
# here's some nice dense python for you:
# this defines our tokenize function for future use
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

### Step 2: Process the docs ###

Now that we have our tokenizing function, we can use it in the function below. This one iterates through our corpus and returns each document in the format (title, tokens), as required. 

**Side note that we've not yet discussed tuples.** Tuples exist in many programming languages, including R. For our purposes, just know that tuples are sequences of objects--  just like lists-- but they cannot be changed. In Python, you indicate a tuple with parentheses. 

In any case, the gensim documentation tells us that we want to define a pre-processing function like this:

In [2]:
# A function to yield each doc in a base directory as a `(filename, tokens)` tuple.

def iter_docs(base_dir):
    docCount = 0
    docs = os.listdir(base_dir)

    for doc in docs:
        if not doc.startswith('.'):
            with open(base_dir + doc, "r") as file:
                text = file.read()
                tokens = tokenize(text) 
        
                yield doc, tokens

The next step is to create a Dictionary (not to be confused with a Python dictionary) which maps each word to a numerical ID. 

This mapping step is required because most algorithms, including gensim's implementation of LDA, rely on numerical libraries that work with vectors indexed by integers, not by strings. Also, many functions need to know the vector/matrix dimensionality in advance.

The mapping can be constructed automatically by giving gensim's Dictionary class a stream of tokenized documents, like so:

In [7]:
# set up the stream 
# this is the one line you'd change here with another corpus and/or corpus location
stream = iter_docs('wheel-clean/')

# all of the rest is standard from the gensim documentation
doc_stream = (tokens for _, tokens in stream)
              
id2word_wheel = gensim.corpora.Dictionary(doc_stream) 

print(id2word_wheel)


NameError: ignored

The Dictionary (id2word_wheel) now contains all words that appeared in the corpus, along with how many times they appeared. 

gensim provides a handy function for mapping tokens to their ID numbers, not unlike the sk-learn vectorizer:

In [None]:
id2word_wheel.token2id

{'abigail': 0,
 'according': 1,
 'activities': 2,
 'added': 3,
 'aiello': 4,
 'aiming': 5,
 'ala': 6,
 'alexandra': 7,
 'allbacking': 8,
 'andhammer': 9,
 'andrew': 10,
 'april': 11,
 'association': 12,
 'athletes': 13,
 'athletic': 14,
 'athleticssenior': 15,
 'athleticsthis': 16,
 'balance': 17,
 'bar': 18,
 'bassen': 19,
 'begun': 20,
 'bend': 21,
 'benjamin': 22,
 'best': 23,
 'birmingham': 24,
 'bonding': 25,
 'brandon': 26,
 'breaking': 27,
 'caitlin': 28,
 'career': 29,
 'carnegiemellon': 30,
 'center': 31,
 'championships': 32,
 'chance': 33,
 'charlie': 34,
 'charlotte': 35,
 'cheeseboro': 36,
 'chicago': 37,
 'clearing': 38,
 'closely': 39,
 'coach': 40,
 'college': 41,
 'come': 42,
 'compete': 43,
 'competed': 44,
 'competing': 45,
 'competition': 46,
 'consisting': 47,
 'couple': 48,
 'courtesy': 49,
 'crane': 50,
 'cromer': 51,
 'curtin': 52,
 'dara': 53,
 'dash': 54,
 'days': 55,
 'delaney': 56,
 'discus': 57,
 'discusthrow': 58,
 'distance': 59,
 'eagle': 60,
 'eagles': 

There aren't many things you need to do in order to tune your topic model, but one important thing do consider is whether you should filter the words. 

gensim also provides functions for this:

In [None]:
# This line below would, for example, filter out 50 most frequent words.
# It's commented out here because I don't want to use it in this case,
# but it's handy to know about. 
# id2word_wheel.filter_n_most_frequent(50)

# this line filters out words that appear only 1 doc, keeping the rest
# I will use this one 
# Note how no_below and no_above take different data types. Not sure why! 
id2word_wheel.filter_extremes(no_below=2, no_above=1.0)

id2word_wheel

INFO:gensim.corpora.dictionary:discarding 52015 tokens: [('allbacking', 1), ('andhammer', 1), ('athleticssenior', 1), ('discusthrow', 1), ('fischerplaced', 1), ('freshmangreg', 1), ('jordanflowers', 1), ('junioradam', 1), ('meterdash', 1), ('meterdistance', 1)]...
INFO:gensim.corpora.dictionary:keeping 34847 tokens which were in no less than 2 and no more than 4008 (=100.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(34847 unique tokens: ['abigail', 'according', 'activities', 'added', 'aiello']...)


<gensim.corpora.dictionary.Dictionary at 0x7f736b87df90>

Note that by removing the words that only appeared in a single document, we went from 86,862 unique words (or tokens) to 34,847. That's not a huge number for a topic model, but we'll see how it goes... 

Snce a streamed corpus and a dictionary is all we need to create the vectors for our topic model, we can get started. 

In [None]:
# a class we need; this is the same for every topic model you create with gensim. 
# no need to modify it here

class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs
    
    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)
    
    def __len__(self):
        return self.clip_docs

In [None]:
# create a stream of bag-of-words vectors
# here's another place where you'd change the name/location of the corpus if you want to
# run a topic model on something else
wheel_corpus = Corpus('wheel-clean/', id2word_wheel)

# print the first vector in the stream to see what it looks like; 
# this is in the format (word_id, count in first doc)

vector = next(iter(wheel_corpus))

vector  

[(0, 1),
 (1, 2),
 (2, 1),
 (3, 4),
 (4, 2),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 2),
 (10, 3),
 (11, 3),
 (12, 3),
 (13, 1),
 (14, 1),
 (15, 2),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 7),
 (21, 2),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 2),
 (27, 1),
 (28, 1),
 (29, 5),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 1),
 (34, 1),
 (35, 2),
 (36, 1),
 (37, 1),
 (38, 1),
 (39, 2),
 (40, 3),
 (41, 1),
 (42, 1),
 (43, 3),
 (44, 1),
 (45, 1),
 (46, 3),
 (47, 1),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 12),
 (52, 1),
 (53, 2),
 (54, 3),
 (55, 4),
 (56, 1),
 (57, 4),
 (58, 1),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 2),
 (63, 10),
 (64, 1),
 (65, 1),
 (66, 1),
 (67, 7),
 (68, 7),
 (69, 1),
 (70, 1),
 (71, 1),
 (72, 1),
 (73, 1),
 (74, 3),
 (75, 4),
 (76, 1),
 (77, 1),
 (78, 4),
 (79, 4),
 (80, 4),
 (81, 1),
 (82, 1),
 (83, 1),
 (84, 2),
 (85, 1),
 (86, 9),
 (87, 10),
 (88, 1),
 (89, 1),
 (90, 4),
 (91, 1),
 (92, 1),
 (93, 1),
 (94, 1),
 (95, 2),
 (96, 2),
 (97, 1),
 (98, 1),
 (99, 1),
 (100, 

In [None]:
# now we're ready to run our topic model!

%time lda_model = gensim.models.LdaModel(wheel_corpus, num_topics=20, id2word=id2word_wheel, passes=5) 

# note that passes should be higher -- usually in the 50-100 range -- 
# but in the interests of time we'll only do 5 


INFO:gensim.models.ldamodel:using symmetric alpha at 0.05
INFO:gensim.models.ldamodel:using symmetric eta at 0.05
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 20 topics, 5 passes over the supplied corpus of 4008 documents, updating model once every 2000 documents, evaluating perplexity every 4008 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #2000/4008
INFO:gensim.models.ldamodel:merging changes from 2000 documents into a model of 4008 documents
INFO:gensim.models.ldamodel:topic #6 (0.050): 0.007*"emory" + 0.006*"said" + 0.004*"time" + 0.004*"film" + 0.003*"like" + 0.003*"trump" + 0.003*"atlanta" + 0.003*"students" + 0.003*"year" + 0.003*"student"
INFO:gensim.models.ldamodel:topic #16 (0.050): 0.007*"said" + 0.006*"emory" + 0.005*"film" + 0.005*"student" + 0.004*"students" + 0.004*"like" + 0.004*"college" + 0.003*"peopl

CPU times: user 1min 1s, sys: 17.6 s, total: 1min 19s
Wall time: 1min


In [None]:
# some additional helpful functions built into LdaModel

# how to store corpus to LOCAL disk (for Colab users, see below)
from gensim.corpora import MmCorpus
MmCorpus.serialize('./wheel.corpus.mm', wheel_corpus)

# how to store dictionary to disk
id2word_wheel.save('./wheel.dictionary')

# how to store model to disk 
lda_model.save('./lda_wheel-20topics_5iters.model')

INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to ./wheel.corpus.mm
INFO:gensim.matutils:saving sparse matrix to ./wheel.corpus.mm
INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:saved 4008x34847 matrix, density=0.667% (930947/139666776)
INFO:gensim.corpora.indexedcorpus:saving MmCorpus index to ./wheel.corpus.mm.index
INFO:gensim.utils:saving Dictionary object under ./wheel.dictionary, separately None
INFO:gensim.utils:saved ./wheel.dictionary
INFO:gensim.utils:saving LdaState object under ./lda_wheel-20topics_5iters.model.state, separately None
INFO:gensim.utils:saved ./lda_wheel-20topics_5iters.model.state
INFO:gensim.utils:saving LdaModel object under ./lda_wheel-20topics_5iters.model, separately ['expElogbeta', 'sstats']
INFO:gen

In [None]:
# how to store the above files to Google Drive 

from google.colab import drive
drive.mount('/content/gdrive')

from gensim.corpora import MmCorpus
# how to store corpus to Drive 
MmCorpus.serialize('/content/gdrive/My Drive/wheel.corpus.mm', wheel_corpus)

# how to store dictionary to Drive
id2word_wheel.save('/content/gdrive/My Drive/wheel.dictionary')

# how to store model to Drive 
lda_model.save('/content/gdrive/My Drive/lda_wheel-20topics_5iters.model')

INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to /content/gdrive/My Drive/wheel.corpus.mm
INFO:gensim.matutils:saving sparse matrix to /content/gdrive/My Drive/wheel.corpus.mm
INFO:gensim.matutils:PROGRESS: saving document #0


Mounted at /content/gdrive


INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:saved 4008x34847 matrix, density=0.667% (930947/139666776)
INFO:gensim.corpora.indexedcorpus:saving MmCorpus index to /content/gdrive/My Drive/wheel.corpus.mm.index
INFO:gensim.utils:saving Dictionary object under /content/gdrive/My Drive/wheel.dictionary, separately None
INFO:gensim.utils:saved /content/gdrive/My Drive/wheel.dictionary
INFO:gensim.utils:saving LdaState object under /content/gdrive/My Drive/lda_wheel-20topics_5iters.model.state, separately None
INFO:gensim.utils:saved /content/gdrive/My Drive/lda_wheel-20topics_5iters.model.state
INFO:gensim.utils:saving LdaModel object under /content/gdrive/My Drive/lda_wheel-20topics_5iters.model, separately ['expElogbeta', 'sstats']
INFO:gensim.utils:storing np array 'expElogbeta' to /content/gdrive/My D

You can also load in a saved model. This is very helpful to know about, since generating new topic models takes time. 

Here, we're going to load in a (slightly) better topic model of the Emory Wheel with the same number of topics (15), but 50 iterations.

In [None]:
# how you would load an old model from your own google drive
# lda_model = gensim.models.LdaModel.load('/content/gdrive/My Drive/corpora/lda_wheel-20topics_5iters.model')

# how we are going to all load in the same model -- from my google drive 
gdown.download('https://drive.google.com/uc?export=download&id=1cUQR3tGyKb1uGbDz3oOIRYyWHXsDoO_l', quiet=False) 

# unzip it 
!unzip lda_wheel-20topics_50iters.zip

# load it 
lda_model = gensim.models.LdaModel.load('lda_wheel-20topics_50iters.model')

Downloading...
From: https://drive.google.com/uc?export=download&id=1cUQR3tGyKb1uGbDz3oOIRYyWHXsDoO_l
To: /content/lda_wheel-20topics_50iters.zip
100%|██████████| 4.00M/4.00M [00:00<00:00, 97.4MB/s]

Archive:  lda_wheel-20topics_50iters.zip
  inflating: lda_wheel-20topics_50iters.model  
  inflating: __MACOSX/._lda_wheel-20topics_50iters.model  
  inflating: lda_wheel-20topics_50iters.model.expElogbeta.npy  
  inflating: __MACOSX/._lda_wheel-20topics_50iters.model.expElogbeta.npy  
  inflating: lda_wheel-20topics_50iters.model.id2word  
  inflating: __MACOSX/._lda_wheel-20topics_50iters.model.id2word  
  inflating: lda_wheel-20topics_50iters.model.state  
  inflating: __MACOSX/._lda_wheel-20topics_50iters.model.state  



INFO:gensim.utils:loading LdaModel object from lda_wheel-20topics_50iters.model
INFO:gensim.utils:loading expElogbeta from lda_wheel-20topics_50iters.model.expElogbeta.npy with mmap=None
INFO:gensim.utils:setting ignored attribute dispatcher to None
INFO:gensim.utils:setting ignored attribute state to None
INFO:gensim.utils:setting ignored attribute id2word to None
INFO:gensim.utils:loaded lda_wheel-20topics_50iters.model
INFO:gensim.utils:loading LdaState object from lda_wheel-20topics_50iters.model.state
INFO:gensim.utils:loaded lda_wheel-20topics_50iters.model.state


In [None]:
# gensim comes with a bunch of functions that make interacting with the output of the topic
# model a little easier. this one shows the topics. 

# show the topics, in the format (number of topics to show, number of terms)
# note that all words are in all topics, just some topics consist of very very small
# proportions of that word

# as you can tell already, even the top words in each topic are only a very small proportion
# of that topic, since we are dealing with about 14K unique words

lda_model.show_topics(15, 20)

[(5,
  '0.041*"instant" + 0.041*"steak" + 0.022*"banned" + 0.021*"curve" + 0.021*"messenger" + 0.021*"manners" + 0.021*"tequila" + 0.021*"buti" + 0.020*"adderall" + 0.020*"atmaggie" + 0.020*"becauseeveryone" + 0.020*"cushy" + 0.011*"correspondent" + 0.007*"sports" + 0.006*"drinking" + 0.003*"slap" + 0.003*"ass" + 0.002*"caesar" + 0.002*"ruderman" + 0.002*"plant"'),
 (6,
  '0.019*"food" + 0.008*"coffee" + 0.006*"dining" + 0.006*"like" + 0.005*"duc" + 0.005*"restaurant" + 0.005*"meal" + 0.004*"market" + 0.004*"options" + 0.004*"emory" + 0.004*"menu" + 0.003*"cox" + 0.003*"store" + 0.003*"chicken" + 0.003*"kaldi" + 0.003*"offers" + 0.003*"small" + 0.003*"eat" + 0.003*"friend" + 0.003*"atlanta"'),
 (12,
  '0.025*"emory" + 0.024*"riders" + 0.022*"horse" + 0.018*"riding" + 0.016*"said" + 0.016*"horses" + 0.014*"team" + 0.014*"equestrian" + 0.013*"members" + 0.013*"like" + 0.010*"lee" + 0.010*"body" + 0.010*"chastain" + 0.010*"oquindo" + 0.009*"caption" + 0.008*"people" + 0.008*"community" + 

In [None]:
# let's format the words a little more nicely; 
# the formatted=False parameter returns tuples of (word, probability)

topics = lda_model.show_topics(20, 20, formatted=False)

for topic in topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

T0: game, games, players, marvel, new, series, player, man, world, batman, black, characters, weapons, battle, duty, superman, map, universe, character, superhero, 
T1: ono, ancient, nsa, garland, patterson, blanco, thomson, mediterranean, farley, ap, goodwin, blm, levinson, ib, roth, ramanujan, dna, credit, noname, tri, 
T2: said, positioning, trot, saddle, carter, library, trainings, footwork, utilities, collisions, sathletic, poetry, museum, book, writing, center, collection, american, rose, history, 
T3: mice, sincerely, doolino, love, people, ve, probably, society, funny, today, spoke, sex, jokes, better, hope, like, best, wrong, won, laugh, 
T4: atlanta, city, new, grooming, united, world, million, air, york, line, china, patriots, marta, space, light, annexation, bowl, project, road, county, 
T5: instant, steak, banned, curve, messenger, manners, tequila, buti, adderall, atmaggie, becauseeveryone, cushy, correspondent, sports, drinking, slap, ass, caesar, ruderman, plant, 
T6: f

## Topics and Labels

Now you can see, perhaps, that we call a "topic" is really just a list of the most probable words for that topic, which are sorted in descending order of probability. The most probable word for the topic is the first word. 

Topic models start to get more powerful when we, as human researchers, analyze the most probable words for every topic and summarize what these words have in common. This summary can then be used as a descriptive label for the topic. Remember, since an LDA topic model is an unsupervised algorithm, it doesn't know what these words mean in relationship to one another. It's up to us, as the human researchers, to make meaning out of the topics.

How might you label the following topics?

✨Topic 6:✨

`food, coffee, dining, like, duc, restaurant, meal, market, options, emory, menu, cox, store, chicken, kaldi, offers, small, eat, friend, atlanta`

In [None]:
# Food and Eating

✨Topic 7:✨

`T7: emory, said, students, university, according, college, student, campus, year, school, president, community, wheel, new, program, work, faculty, health, members, wrote`

In [None]:
# College Life

✨Topic 17:✨

`T17: flu, vaccination, people, season, cdc, years, like, vaccine, virus, children, risk, reported, severe, deaths, received, early, according, contracting, medical, treat`



In [None]:
# Health and Well-being

## Refining the topic model

These are decent topics, but they're not amazing. Here are a few things you might want to try in order to fine-tune your model:

* Filtering some of the most common words (see the filtering function above)
* Generating more or fewer topics (we could try 10, for instance). 

Feel free to try those things on your own. 

But for the purposes of this class, let's take a bit of a closer look at the probabilities attached to each word in a single topic. 

In [None]:
# T6 looks coherent
topic = topics[6]

# the first item of the topic is the topic number
topic_num = topic[0]

# the next item is another list with pairs of words and percentages
topic_pairs = topic[1]
for pair in topic_pairs:
    print(pair[0] + ": " + str(pair[1]))

# since all topics contain all words, the sum of all of the probabilities of each 
# topic should be 1


food: 0.018672202
coffee: 0.007631046
dining: 0.005781589
like: 0.0055155074
duc: 0.0051085055
restaurant: 0.0049236505
meal: 0.004581294
market: 0.0043532657
options: 0.0040271166
emory: 0.0038891642
menu: 0.0038616052
cox: 0.003395483
store: 0.003207978
chicken: 0.0031602818
kaldi: 0.0030526472
offers: 0.0030273078
small: 0.0030035032
eat: 0.0029921897
friend: 0.0029554209
atlanta: 0.0029536472


Let's flip it around and look at the topical composition of a single document. 

NB: MALLET provides this output automatically, but with gensim there's a bit more work required.

In [None]:
tokens = [] 

# open one file
with open('wheel-clean/2014-10-02-Atlanta-Food-Truck-Park-Brings-Enriching-Epicurian-Experience.txt', "r") as file:
    text = file.read()
    tokens = tokenize(text) # remember this from above

# create the bag of words for the document on the basis of the Wheel dictionary, created above
doc_bow = id2word_wheel.doc2bow(tokens)

# get the topics that the doc consists of
doc_topics = lda_model.get_document_topics(doc_bow)

doc_topics
    



[(0, 0.017236043),
 (2, 0.015564269),
 (3, 0.029299729),
 (4, 0.04100305),
 (6, 0.01669641),
 (7, 0.11078718),
 (8, 0.064258404),
 (9, 0.12697448),
 (11, 0.10035316),
 (12, 0.06897044),
 (13, 0.020446984),
 (14, 0.030639485),
 (15, 0.04768126),
 (16, 0.13759154),
 (19, 0.15188634)]

In [None]:
# now we can cross-reference to find those topics and words

for topic, prob in doc_topics:
    print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document.")
          
        #  str(round(prob, 2)))

    topic_words = "Top words in topic: "
    select_topics = topics[topic]
    
    for pair in select_topics[1]:
        topic_words += pair[0] + ", "
    
    print(topic_words)
 

T0: 1.72% of document.
Top words in topic: game, games, players, marvel, new, series, player, man, world, batman, black, characters, weapons, battle, duty, superman, map, universe, character, superhero, 
T2: 1.56% of document.
Top words in topic: said, positioning, trot, saddle, carter, library, trainings, footwork, utilities, collisions, sathletic, poetry, museum, book, writing, center, collection, american, rose, history, 
T3: 2.93% of document.
Top words in topic: mice, sincerely, doolino, love, people, ve, probably, society, funny, today, spoke, sex, jokes, better, hope, like, best, wrong, won, laugh, 
T4: 4.10% of document.
Top words in topic: atlanta, city, new, grooming, united, world, million, air, york, line, china, patriots, marta, space, light, annexation, bowl, project, road, county, 
T6: 1.67% of document.
Top words in topic: food, coffee, dining, like, duc, restaurant, meal, market, options, emory, menu, cox, store, chicken, kaldi, offers, small, eat, friend, atlanta, 
T7

### Evaluating Topics ###

Gensim has several built-in methods for evaluating topics included as a model called [CoherenceModel](https://radimrehurek.com/gensim/models/coherencemodel.html). The fastest one to calculate is called u_mass, and in this case, the closer to zero (positive or negative), the better the score. 

Let's see how our model performs: 

In [None]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lda_model, corpus=wheel_corpus, coherence='u_mass')

coherence = cm.get_coherence()  # get coherence value

coherence

INFO:gensim.topic_coherence.text_analysis:CorpusAccumulator accumulated stats from 1000 documents
INFO:gensim.topic_coherence.text_analysis:CorpusAccumulator accumulated stats from 2000 documents
INFO:gensim.topic_coherence.text_analysis:CorpusAccumulator accumulated stats from 3000 documents
INFO:gensim.topic_coherence.text_analysis:CorpusAccumulator accumulated stats from 4000 documents


-11.785140866507527

Here's a review essay by Hanna Wallach et al. that summarizes a few methods of evaluation, including some involving humans in the loop: ["Evaluation Methods for Topc Models"](http://dirichlet.net/pdf/wallach09evaluation.pdf).

Another way to evalute topics is just to look at them.

The pyLDAvis library lets you do this in a single line. It's very satisfying! 

In [None]:
# install the version of pyLDAvis that works for Colab (note, not the most recent one)
!pip install pyLDAvis==2.1.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 19.8 MB/s 
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97738 sha256=8db558823117f6ad907fd6ce118602bedc5fe5b52336fee336dfe2226c1727ef
  Stored in directory: /root/.cache/pip/wheels/3b/fb/41/e32e5312da9f440d34c4eff0d2207b46dc9332a7b931ef1e89
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.17 pyLDAvis-2.1.2


In [None]:
# LDA visualization tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# just reformat the corpus for pyLDAvis 
from gensim.corpora import MmCorpus
wheel_mm_corpus = MmCorpus('./wheel.corpus.mm')

pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda_model, wheel_mm_corpus, id2word_wheel)

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt
  from collections import Iterable
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./wheel.corpus.mm.index
INFO:gensim.corpora._mmreader:initializing cython corpus reader from ./wheel.corpus.mm
INFO:gensim.corpora._mmreader:accepted corpus with 4008 documents, 34847 features, 930947 non-zero entries
  head(R).drop('saliency', 1)


*Lauren Klein wrote this lesson in 2019 drawing on writing by [Ted Underwood](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) and [Matthew Jockers](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/), this [video of a talk by David Mimno](https://vimeo.com/53080123), and [this notebook](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html) by Radim Rehurek. It was supplemented in 2020 with additional materials from Dan Sinykin, and revised again in 2021 and 2022 by Lauren Klein.* 