# Topic modeling

Having written [a lot of prose to explain topic modeling](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) elsewhere, I won't repeat myself at length here.

Suffice it to say that this notebook demonstrates an implementation of LDA in python, using the ```gensim``` module.

Topic modeling is an area where sheer compute power starts to matter more than it has in most of our other work, and I don't think ```gensim``` is necessarily the fastest implementation. If you wanted to apply topic modeling to a large corpus, it might be worthwhile figuring out how to use gensim in a "distributed" way, or exploring another implementation, such as [```MALLET.```](http://mallet.cs.umass.edu) MALLET is the most commonly-used implementation in digital humanities, and there's [a good Programming Historian tutorial.](http://programminghistorian.org/lessons/topic-modeling-and-mallet) However, MALLET requires Java, and I wanted to limit the number of installation problems we confront.


In [13]:
import gensim
import os, math
import pandas as pd
import nltk

# nltk.download('stopwords')
# nltk.download('punkt')
# You may not have the stopwords downloaded yet.
# You can comment this out after it runs once.


### Load a corpus

I've provided three corpora: ```tinywikicorpus.csv```, ```smallwikicorpus.csv```, and ```mediumwikicorpus.csv.```

This stuff gets compute-intensive pretty fast, so let's start with the small one. This has 250 Wikipedia pages, each on a separate line of the file -- and only the first 250 words of each page. The tiny corpus has 160 words of 160 pages; the medium corpus has 400 words from 400 pages.

Obviously, this is not a huge corpus! But in real-life applications, you have to distribute topic modeling over multiple cores, and even then it's common to wait several hours for a result. That doesn't adapt very well to a classroom experiment.

In [70]:
# Very simply, reading the corpus from a text file.
# Each page is on a separate line.

# relativepath = os.path.join('..', 'data', 'smallwikicorpus.txt')
# relativepath = os.path.join('..', 'data', 'mediumwikicorpus.txt')
relativepath = os.path.join('..', 'data', 'poefic.csv')
poefic = pd.read_csv(relativepath)
fictioncorpus = [' '.join(x.split()) for x in poefic.text] # or slice it more if you want, or don't

# print(fictioncorpus[0])
print(len(fictioncorpus))

# wikicorpus = []
# with open(relativepath, encoding = 'utf-8') as f:
#     for line in f:
#         wikicorpus.append(line.strip())
        
# print(wikicorpus)

1027


### Prepare the corpus for topic modeling

In part this is a simple tokenizing job. We have represented Wikipedia pages as single strings; gensim is going to expect each document to be a *list* of words. So we need to split the document into words.

But in the process of doing that, we also want to get rid of extremely common words, which make a topic model difficult to read and interpret.

To do this, we create a list of "stopwords." We also remove punctuation, and lowercase everything.

In [60]:
from nltk.corpus import stopwords

# We're going to borrow a list of stopwords from nltk.

# This list of "stopwords" removed from the corpus is not
# a trivial, generic decision; your choice of stopwords can
# in practice significantly affect the result. Here's a place where
# the open-ended character of an unsupervised learning algorithm
# becomes tricky.

# stopwords = {'a', 'an', 'the', 'of', 'and', 'in', 'to', 'by', 'on', 'for', 'it', 'at', 'me', 'from', 'with', '.', ','}
# in case you can't access nltk

from nltk.tokenize import word_tokenize
import string

stopped = set(stopwords.words('english'))
punctuation = set(string.punctuation)
stopped = stopped.union(punctuation)

more_stops = {"paul", "john", "jack", "\'s", "nt",
              "``", "\'the", ";", '“', 'pb', "mary", 
              "henry", "arthur", "polly", "alice", 
              "jane", "jean", "michael", "harold",
             "tom", "richard"}
# When you're topic-modeling fiction, personal names
# present a special problem.

stopped = stopped.union(more_stops)
punctuation.add('“')
punctuation.add('”')
punctuation.add('—')

def strip_punctuation(atoken):
    global punctuation
    punct_stripped = ''.join([ch for ch in atoken if ch not in punctuation])
    return punct_stripped

def clean_text(atext):
    global stopped
    clean_version = [strip_punctuation(x) for x in word_tokenize(atext.lower())]
    rejoined = ' '.join(clean_version)
    tokenized = [x for x in word_tokenize(rejoined.lower()) if not x in stopped]
    return tokenized

clean_corpus = []
clean_fictioncorpus = []
for atext in fictioncorpus:
    clean_version = clean_text(atext)
    if len(clean_version) > 1:
        clean_fictioncorpus.append(clean_version)
    
print("The clean_corpus contains " + str(len(clean_fictioncorpus)) + " texts.")
print(type(fictioncorpus))
# print(clean_fictioncorpus)

The clean_corpus contains 1027 texts.
<class 'list'>


### Build a dictionary and create the doc-term matrix

The math inside ```gensim``` runs quicker if we know, at the outset, how many words we're dealing with, and represent each word as an integer. So the first stage in building a model is to build a dictionary, which stores words as the values of integer keys.

In [62]:
from gensim import corpora

dictionary = corpora.Dictionary(clean_fictioncorpus)
dictionary.filter_extremes(no_below = 4, no_above = 0.11)

# The filter_extremes method allows us to remove words from the dictionary.
# In this case we remove words that occur in fewer than 4 documents, or more
# than 11% of the documents in the corpus. This is, in effect, another
# form of stopwording.

# If you had a much larger corpus, you might increase no_below to 10 or 20.

print('Dictionary made.')
print(len(dictionary), "words.")
print(len(clean_fictioncorpus), "documents.")
doc_term_matrix = [dictionary.doc2bow(doc) for doc in clean_fictioncorpus if len(doc) > 1]
print('Doc-term matrix extracted.')


Dictionary made.
13178 words.
1027 documents.
Doc-term matrix extracted.


In [63]:
# Just to show you what's in the dictionary.

print(dictionary[1069])
print(dictionary[880])

bay
carthage


In [64]:
# And what our corpus looks like now.
# Each tuple contains a word ID, and the number of occurrences of that word.

print(doc_term_matrix[4])

[(36, 1), (91, 1), (165, 1), (202, 1), (215, 1), (220, 1), (253, 1), (283, 1), (319, 1), (398, 3), (436, 1), (541, 1), (568, 1), (665, 1), (699, 2), (719, 1), (828, 1), (851, 1), (868, 1), (872, 1), (958, 1), (1017, 1), (1066, 3), (1074, 1), (1144, 1), (1183, 1), (1199, 1), (1248, 1), (1256, 1), (1268, 1), (1457, 1), (1536, 1), (1554, 1), (1572, 2), (1619, 1), (1643, 1), (1674, 1), (1709, 1), (1768, 1), (1856, 1), (1902, 1), (1925, 1), (2130, 1), (2261, 1), (2334, 2), (2425, 1), (2450, 1), (2486, 1), (2491, 1), (2575, 1), (2629, 1), (2708, 2), (2759, 1), (2789, 1), (2809, 1), (2832, 1), (3074, 1), (3079, 1), (3228, 1), (3273, 1), (3316, 1), (3370, 1), (3375, 1), (3389, 1), (3429, 1), (3464, 1), (3523, 1), (3670, 1), (3684, 1), (3691, 1), (3740, 1), (3756, 1), (3802, 1), (3900, 1), (3940, 2), (4068, 1), (4179, 1), (4275, 1), (4288, 1), (4389, 1), (4415, 1), (4518, 1), (4575, 1), (4617, 1), (4678, 1), (4711, 1), (4746, 1), (4762, 1), (4769, 1), (4779, 1), (4845, 1), (4879, 1), (4955, 1),

### Actually running LDA

The first line here creates an LDA-modeling demon.
The second line asks the demon to create a model of our corpus.

```num_topics``` and ```passes``` are both parameters you may want to fiddle with. Sixteen topics is a pretty small number. In a larger corpus that would be increased. For our medium corpus, you might try 20 or 25. As with clustering, there are strategies that can attempt to optimize the "right" number, but this is in reality a matter of judgement.

```passes``` sets the number of iterations. More is better, up to a thousand or so. But for a classroom experiment, we probably don't want to go over 200.

In [65]:
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics = 16, id2word = dictionary, passes = 50)

In [66]:
def pretty_print_topics(topiclist):
    for topicnum, topic in topiclist:
        cleanwords = []
        pieces = topic.split(' + ')
        for p in pieces:
            numword = p.split('*')
            word = numword[1].strip('"')
            cleanwords.append(word)
        print(topicnum, ' '.join(cleanwords))

pretty_print_topics(ldamodel.print_topics(num_topics=16, num_words=10))

0 dreaming lyrics willie luther sins joys hell freedom pleasures lays
1 de st paris banks england interest des robert page x
2 poems farewell autumn tomb ray lake lone joys afar wing
3 four israel sword ho chief paradise tonight aaron david r
4 hell law freedom hate eleanor nation falls southern evil lion
5 boat ship captain sail lieutenant th bee deck shop buck
6 says jerusalem doom christ july food ann iv awful flesh
7 3 2 jesus clay 4 says joan witness lines 5
8 modern boys cave general ar st ball says send milly
9 letter table suppose ought idea case business manner certainly happened
10 jenny ring pine yellow thick crimson laughter mouth lying magic
11 wi hae sae nae frae aye ken laird weel wee
12 woods mirth tha keith sculpture identical sculptor fountain canto th
13 quot wold nay liberty canto duty margaret baby london poet
14 emma saxon company dog sal joan « says kitchen added
15 burns smith prince william presented palace c portrait published castle


Is that impressive? Probably not. The value of topic modeling depends heavily on the size of the corpus, and we are deliberately using small corpora to avoid frying your laptops.

If it ran quickly enough you might try increasing the number of iterations to 200. See if those topics seem to make more sense. If *that* runs quickly enough, you might try loading the mediumwikicorpus.csv, to see if you get even more interpretable topics. But it will probably take 10-15 minutes to run, at a minimum.

### ANALYSIS
I decided to run the topic modeling on the entire poefic dataset by removing the slicing from the code we (mostly Ted) drafted during class. It's interesting to see slightly better topics here, though not the perfect set of more siloed ideas that I think is the researcher's dream when thinking about and running topic modeling. But, nonetheless, we have some interesting outputs. The poetry data seems to have coalesced around a few topics, most evidently 13 and 11, which show signs of the different verbiage of poetry with "quot" and "wold" and "nay." Topic 11 also seems to have atypical language, I believe (from my years as an English undergraduate student) that resembles poetry in the Scots dialect (or something else British) similar to Robert Burns.

We also have other topics that have grouped together around some potential themese, such as topic 5, which seems like a lot of nautical terms, or at least a few heavily nautical terms wrapped up with others that can be read this way. Similarly, topics like 7, 0 and 4 could be biblical or reglious (the words and the numbers hint at this), about more existential texts, or historical fiction, respectively.

However, in the end, as I find with all topic modeling, it is only as useful as your research question. In this case, being relatively ignorant to the depth and breadth of the corpus, and without a research question, I don't find the topics incredibly useful. Personally, topic modeling is neat, and I like the functionality showcased, but even when I've had a research question, it seems incredibly shallow analytically. I think with a long list of stop words tailored to a research question, there could be something compelling with topic modeling. Similarly, using topic modeling to help identify types of texts seems pretty within reach (like poetry in this example). But without that, you are left with results that are sculptable--a good trait sometimes, but not so compelling to use as evidence of any in-depth analysis, in my opinion.

#### Other things you can do

One of the nice things about the gensim module is that it allows you to update an existing model; you can even add documents to the corpus and update the model.

In addition to getting the top words for a given topic (topic distribution across terms), you can get the distribution of a document across topics, or the distribution of a word across topics. For more on these options, see [the documentation.](https://radimrehurek.com/gensim/models/ldamodel.html)

In [10]:
ldamodel.update(doc_term_matrix, iterations = 50)

In [11]:
ldamodel.get_document_topics(doc_term_matrix[6])

[(1, 0.98600745228817488)]

In [12]:
ldamodel.get_term_topics('rock')

[(3, 0.023670363990880825)]