## Running LDA

Running a topic model on (roughly) the 30,000 most common words in the **genremeta** corpus, as part of research for "The Historical Significance of Textual Distance."

This was the first time I had used LDA in scikit-learn. I was guided by Aneesha Bakharia's post on ["Topic Modeling with Scikit-Learn,"](https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730) and borrowed some code snippets.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import csv
from collections import Counter
import numpy as np
import pandas as pd

#### Data preparation

We begin by counting "document frequencies" for words in the corpus. Then we save a vocabulary of the 30,000 words with highest doc frequency.

In [5]:
# count doc frequencies

vocabcount = Counter()

with open('../parsejsons/rawcounts.tsv', encoding = 'utf-8') as f:
    header = False
    for line in f:
        if not header:
            header = True
        else:
            row = line.strip().split('\t')
            word = row[1]
            vocabcount[word] += 1 

In reality these words are already tokenized. But to take advantage of the sklearn tokenizer and avoid worries about data format I create pretend "documents" and allow the CountVectorizer to recount them.

In [12]:
vocab = set([x[0] for x in vocabcount.most_common(30000)])

docs = []

with open('../parsejsons/rawcounts.tsv', encoding = 'utf-8') as f:
    header = False
    doc = []
    lastdoc = ''
    ctr = 0
    for line in f:
        if not header:
            header = True
            
        else:
            row = line.strip().split('\t')
            docid = row[0]
            word = row[1]
            count = int(row[2])
            if lastdoc == '':
                lastdoc = docid
            elif docid != lastdoc:
                docs.append(' '.join(doc))
                doc = []
                lastdoc = docid
                ctr += 1
            if word in vocab:
                doc.extend([word] * count)          

In [13]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=30000, stop_words='english')

In [14]:
tf = tf_vectorizer.fit_transform(docs)
tf_feature_names = tf_vectorizer.get_feature_names()

In [15]:
tf_feature_names[0:10]

['10', '10th', '11', '12', '12th', '13th', '14th', '15th', '16th', '17th']

In [32]:
print(len(tf_feature_names))

28443


The actual length of the vocabulary after discarding stop words, etc, is 28,443 tokens.

### Performing LDA

This took 5-6 hours, even running multicore.

In [16]:
no_topics = 100
lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=50, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

In [18]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

#### display the topics

Relying heavily on Bakharia's code right here.

In [19]:
display_topics(lda, tf_feature_names, 10)

Topic 0:
ve sir mr doctor mrs inside ca brother maybe sort
Topic 1:
ve ca mr maybe inside city sir car miss street
Topic 2:
ve mr mrs maybe miss god sir ca car inside
Topic 3:
mr ve mrs sir street god inside maybe miss road
Topic 4:
ve mr maybe inside car sir book mrs ca pulled
Topic 5:
mr mrs miss street sir fellow sort ve shall lady
Topic 6:
ve ca mr god miss mrs inside car maybe brother
Topic 7:
san el la spanish del litde sal nat murphy fay
Topic 8:
ve mr mrs ca god maybe street wo inside sea
Topic 9:
grant beth miller junior daddy momma auntie laurel jenkins faith
Topic 10:
ve mr god ca mrs lot inside maybe car sir
Topic 11:
sea boat ship captain island deck shore beach fish land
Topic 12:
ve mr car mrs street ca maybe prince floor lot
Topic 13:
ve maybe ca mr pulled god king street doctor sea
Topic 14:
king prince court lord brother evans apartment pineapple yellow terrible
Topic 15:
ai em ye ve folks goin dat yer ca er
Topic 16:
finn ron sunny dixon pirates bliss conan pirate so

In [21]:
doc_topic_dist_unnormalized = np.matrix(lda.transform(tf))

In [22]:
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

In [23]:
doc_topic_dist.shape

(6845, 100)

In [24]:
doc_topic_dist.tofile('doc_topic_dist.binary')

#### pair docids with rows of the doc_topic_dist and write to file

first let me count documents to make sure they line up

In [25]:
docids = []
docset = set()

with open('../parsejsons/rawcounts.tsv', encoding = 'utf-8') as f:
    header = False
    for line in f:
        if not header:
            header = True
        else:
            row = line.strip().split('\t')
            docid = row[0]
            if docid not in docset:
                docids.append(docid)
                docset.add(docid)
print(len(docids))

6846


Ah. Now I realize that the code I used to read documents missed the last one! LDA takes too long to run to re-do this; I'm going to just cut the last doc, which turns out to be a single volume in randomA. We won't miss it.

In [26]:
docids[-1]

'uc1.b4357818'

In [30]:
docdist = pd.DataFrame(doc_topic_dist, index = docids[0: -1])
docdist.shape

(6845, 100)

In [31]:
docdist.to_csv('doc_topic_distribution.csv', index_label = 'docid')