# Topic Modeling

A Topic Modeling algorithm ends up with each topic consisting of a collection of terms in different proportions given by P(topic|term), and each document is a collection of topics in different proportions given by P(doc|topic). Topic modeling is an unsupervised algorithm. There are various algorithms available in gensim, as listed below:

* LSI (Latent Semantic Indexing)
* HDP (Hierarchical Dirichlet Process)
* LDA (Latent Dirichlet Allocation)
* Mallet wrapper

Of these, HDP can work out the optimum number of topics, so we use that here.

We use topic modeling as a dimensionality reduction technique, whereby each document moves from being a vector of tokens (vocabulary words and n-grams) to being a smaller, hopefully denser, vector of topics. Collection of document topic vectors is a matrix, which can be used to generate a document-document similarity matrix.

Topic Modeling code is adapted from this [gensim example notebook](https://markroxor.github.io/gensim/static/notebooks/gensim_news_classification.html), as well as the [HDP Documentation Page](https://radimrehurek.com/gensim/models/hdpmodel.html).

In [1]:
import ahocorasick
import gensim
import nltk
import numpy as np
import os
import pickle
import re
import spacy
import string

In [2]:
DATA_DIR = "../data"
MODEL_DIR = "../models"

CURATED_KEYWORDS = os.path.join(DATA_DIR, "raw_keywords.txt")
KEYWORD_MAPPING_FILES = [
    os.path.join(DATA_DIR, "keyword_neardup_mappings.tsv"),
    os.path.join(DATA_DIR, "keyword_dedupe_mappings.tsv")
]

CUSTOM_STOPWORDS = os.path.join(DATA_DIR, "stopwords.txt")

TEXTFILES_DIR = os.path.join(DATA_DIR, "textfiles")
TEXTFILES_PREPROC = os.path.join(DATA_DIR, "textfiles_preproc.txt")

HDP_MODEL = os.path.join(MODEL_DIR, "hdp_model.gensim")

TOPIC_SIMS = os.path.join(DATA_DIR, "topic_sims.npy")
TOPIC_LOOKUP = os.path.join(DATA_DIR, "topic_docid2corpus.pkl")

PAPERS_METADATA = os.path.join(DATA_DIR, "papers_metadata.tsv")

## Preprocessing

We apply the following three preprocessing steps to our data.
* Collocation Detection
* Lemmatization
* Stopword Removal

### Collocation Detection

We will load up a trie with our curated keywords, and then run through each text file replacing the multi-word token into a single one by replacing space chars with underscore. Keywords that refer to the same thing will be collapsed using the mappings.

In [3]:
keywords_dict = ahocorasick.Automaton()
fkeys = open(CURATED_KEYWORDS, "r")
for idx, keyword in enumerate(fkeys):
    keyword = keyword.strip()
    keywords_dict.add_word(keyword, (idx, keyword))
fkeys.close()
keywords_dict.make_automaton()

print("built dictionary trie with {:d} entries".format(len(keywords_dict)))

built dictionary trie with 2282 entries


In [4]:
canonical_keywords = {}
for keyword_mapping_file in KEYWORD_MAPPING_FILES:
    fmap = open(keyword_mapping_file, "r")
    for line in fmap:
        cols = line.strip().split("\t")
        canonical_keywords[cols[0]] = cols[1]
    fmap.close()
    
print("{:d} canonical keyword mappings".format(len(canonical_keywords)))

455 canonical keyword mappings


In [5]:
def find_and_replace(text, keywords_dict, canonical_keywords):
    # find
    matches = []
    for end_index, (insert_order, keyword) in keywords_dict.iter(text):
        start_index = end_index - len(keyword) + 1
        matches.append((start_index, end_index + 1, keyword))
    # replace
    text_chars = list(text)
    for start, end, source in matches:
        if source in canonical_keywords.keys():
            can_source = canonical_keywords[source]
            target = can_source.replace(" ", "_")
        else:
            target = source.replace(" ", "_")
        target = target.rjust(len(source))
        target_chars = list(target)
        j = 0
        for i in range(start, end):
            text_chars[i] = target_chars[j]
            j += 1
    # return
    return re.sub("\s+", " ", "".join(text_chars))


text = """Based on these results it is possible to compare the MEM to other families 
of models (e.g., neural networks and state dependent models). It is shown that a degenerate 
version of the MEM is in fact equivalent to a neural network, and the number of experts 
in the architecture plays a similar role to the number of hidden units in
the latter model."""
text = text.replace("\n", " ")
text = re.sub("\s+", " ", text)
print(text)
print("---")
text_with_collocs = find_and_replace(text, keywords_dict, canonical_keywords)
print(text_with_collocs)

Based on these results it is possible to compare the MEM to other families of models (e.g., neural networks and state dependent models). It is shown that a degenerate version of the MEM is in fact equivalent to a neural network, and the number of experts in the architecture plays a similar role to the number of hidden units in the latter model.
---
Based on these results it is possible to compare the MEM to other families of models (e.g., neural_network and state dependent models). It is shown that a degenerate version of the MEM is in fact equivalent to a neural_network, and the number of experts in the architecture plays a similar role to the number of hidden_unit in the latter model.


### Stopword Removal

Combine SpaCy default, NLTK default and custom stopwords for corpus. We will use SpaCy as our NLP toolkit, so everything merged into SpaCy.

In [6]:
nlp = spacy.load("en")

In [7]:
from spacy.lang.en.stop_words import STOP_WORDS

print("#-stopwords from SpaCy:", len(STOP_WORDS))

# add NLTK stopwords
nltk_stopwords = nltk.corpus.stopwords.words("english")
custom_stopwords = []
custom_stopwords.extend(nltk_stopwords)

# add our corpus specific custom stopwords
fstops = open(CUSTOM_STOPWORDS, "r")
for stopword in fstops:
    custom_stopwords.append(stopword.strip())
fstops.close()

# add punctuation
punct_chars = [c for c in string.punctuation]
for punct_char in punct_chars:
    custom_stopwords.append(punct_char)

for stopword in custom_stopwords:
    STOP_WORDS.add(stopword)
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True
    
print("#-stopwords including custom:", len(STOP_WORDS))

#-stopwords from SpaCy: 305
#-stopwords including custom: 579


### Lemmatization

In [8]:
def lemmatize_and_remove_stopwords(text, nlp, stopwords):
    doc = nlp(text)
    lemmas = []
    for token in doc:
        if token.is_stop:
            continue
        if token.like_num:
            continue
        if len(token.text) <= 5:
            continue
        if token.text.startswith("-") and token.text.endswith("-"):
            # -PRON-, etc
            continue
        lemma = token.lemma_
        lemmas.append(lemma)
    return lemmas


print(lemmatize_and_remove_stopwords(text_with_collocs, nlp, STOP_WORDS))

['compare', 'family', 'neural_network', 'dependent', 'degenerate', 'version', 'equivalent', 'neural_network', 'expert', 'architecture', 'hidden_unit']


### Set up Corpus for Topic Modeling

We apply the pipeline to all our text files, then create the dictionary and corpus objects.

In [9]:
docid2corpus, corpusid2doc = {}, {}
train_texts = []
corpus_id = 0
    
if os.path.exists(TEXTFILES_PREPROC):
    fprep = open(TEXTFILES_PREPROC, "r")
    for line in fprep:
        if corpus_id % 1000 == 0:
            print("{:d} preprocessed texts read".format(corpus_id))
        try:
            filename, text = line.strip().split("\t")
            doc_id = int(filename.split(".")[0])
            tokens = text.split(" ")
            docid2corpus[doc_id] = corpus_id
            corpusid2doc[corpus_id] = doc_id
            train_texts.append(tokens)
        except ValueError:
            pass
        corpus_id += 1
    fprep.close()
    print("{:d} preprocessed texts read, COMPLETE".format(corpus_id))
else:
    fprep = open(TEXTFILES_PREPROC, "w")
    for textfile in os.listdir(TEXTFILES_DIR):
        if corpus_id % 100 == 0:
            print("{:d} files read".format(corpus_id))
        doc_id = int(textfile.split(".")[0])
        docid2corpus[doc_id] = corpus_id
        corpusid2doc[corpus_id] = doc_id
        ftext = open(os.path.join(TEXTFILES_DIR, textfile), "r")
        lines = []
        for line in ftext:
            lines.append(line.strip())
        ftext.close()
        text = " ".join(lines)
        text = text.replace("\n", " ")
        text = re.sub("\s+", " ", text)
        text = find_and_replace(text, keywords_dict, canonical_keywords)
        text = lemmatize_and_remove_stopwords(text, nlp, STOP_WORDS)
        train_texts.append(text)
        fprep.write("{:s}\t{:s}\n".format(textfile, " ".join(text)))
        corpus_id += 1
    print("{:d} files read, COMPLETE".format(corpus_id))
    fprep.close()


0 preprocessed texts read
1000 preprocessed texts read
2000 preprocessed texts read
3000 preprocessed texts read
4000 preprocessed texts read
5000 preprocessed texts read
6000 preprocessed texts read
7000 preprocessed texts read
7238 preprocessed texts read, COMPLETE


In [10]:
dictionary = gensim.corpora.Dictionary(train_texts)
corpus = [dictionary.doc2bow(text) for text in train_texts]

print("#-documents in train_texts: {:d}".format(len(train_texts)))
print("{:d} rows, {:d} to {:d} cols each".format(
    len(corpus), 
    min([len(row) for row in corpus]),
    max([len(row) for row in corpus])))

#-documents in train_texts: 7235
7235 rows, 3 to 1645 cols each


## Build topic model

We use the HDP model because we don't know the correct number of topics for our corpus. Once we have a rough idea (either from a preliminary run of HDP or knowledge about the corpus), we can use [Topic Coherence](https://rare-technologies.com/what-is-topic-coherence/) or [Perplexity](https://en.wikipedia.org/wiki/Perplexity) to decide the best algorithm and/or the best number of topics. In our case, our intent is not to display the topics, but rather to use the topics as features for our similarity calculations, so we don't spend too much time on that and just use the results of HDP.

In [11]:
if os.path.exists(HDP_MODEL):
    # load model
    hdpmodel = gensim.models.HdpModel.load(HDP_MODEL)
else:
    # train model and save
    hdpmodel = gensim.models.HdpModel(corpus=corpus, id2word=dictionary)
    hdpmodel.save(HDP_MODEL)

hdpmodel.print_topics(num_topics=20, num_words=10)

[(0,
  '0.006*figure + 0.004*feature + 0.003*sample + 0.003*theorem + 0.003*weight + 0.003*kernel + 0.003*gaussian + 0.003*dataset + 0.002*learning + 0.002*neural'),
 (1,
  '0.006*figure + 0.004*gaussian + 0.004*kernel + 0.003*weight + 0.003*neural + 0.003*pattern + 0.003*neuron + 0.003*signal + 0.003*system + 0.003*equation'),
 (2,
  '0.005*figure + 0.003*feature + 0.003*weight + 0.003*represent + 0.003*gaussian + 0.003*theorem + 0.003*action + 0.003*neural + 0.003*equation + 0.002*neuron'),
 (3,
  '0.007*action + 0.005*figure + 0.005*system + 0.005*policy + 0.004*pattern + 0.004*control + 0.003*weight + 0.003*state + 0.003*channel + 0.003*context'),
 (4,
  '0.005*figure + 0.004*gaussian + 0.004*weight + 0.003*represent + 0.003*neural + 0.002*classifier + 0.002*control + 0.002*representation + 0.002*learn + 0.002*learning'),
 (5,
  '0.004*figure + 0.003*system + 0.003*weight + 0.003*equation + 0.003*gradient + 0.003*neural + 0.003*sequence + 0.003*gaussian + 0.002*dimension + 0.002*bo

In [12]:
topic_terms = hdpmodel.get_topics()
print(topic_terms.shape)

(150, 305037)


## Infer Topic Vectors from Topic Model

In [13]:
doc_topics = np.zeros((len(corpus), topic_terms.shape[0]))
for doc_id in range(len(corpus)):
    topic_probs = hdpmodel[corpus[doc_id]]
    for topic_id, prob in topic_probs:
        doc_topics[doc_id, topic_id] = prob
        
print(doc_topics.shape)

(7235, 150)


## Compute Cosine Similarity

You can save cycles by parallelizing the operation. So cosine similarity between vectors A and B is defined as:

$$cos(\theta) = \frac{A \cdot  B}{\left \| A \right \| \left \| B \right \|}$$

In order to compute cosine similarity for all pairs pf documents D, each document given by some vector A, we can do the following which is much faster.

$$S = \frac{D * D^T}{\left \| D \right \|^2}$$

For use from within the web tool, we save the similarity matrix to disk so we can use it later without the preceding calculation.

In [14]:
sim = np.matmul(doc_topics, np.transpose(doc_topics)) / np.linalg.norm(doc_topics)
print(sim.shape)

(7235, 7235)


In [15]:
np.save(TOPIC_SIMS, sim)
pickle.dump(docid2corpus, open(TOPIC_LOOKUP, "wb"))

## Find similar documents

In [16]:
def similar_docs(filename, sim, topn, docid2corpus, corpusid2doc):
    doc_id = int(filename.split(".")[0])
    corpus_id = docid2corpus[doc_id]
    row = sim[corpus_id, :]
    target_docs = np.argsort(-row)[0:topn].tolist()
    scores = row[target_docs].tolist()
    target_filenames = ["{:d}.txt".format(corpusid2doc[x]) for x in target_docs]
    return target_filenames, scores
    

filename2title = {}
with open(PAPERS_METADATA, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        cols = line.strip().split("\t")
        filename2title["{:s}.txt".format(cols[0])] = cols[2]

source_filename = "1032.txt"
top_n = 10
target_filenames, scores = similar_docs(source_filename, sim, top_n, 
                                        docid2corpus, corpusid2doc)
print("Source: {:s}".format(filename2title[source_filename]))
print("--- top {:d} similar docs ---".format(top_n))
for target_filename, score in zip(target_filenames, scores):
    print("({:.5f}) {:s}".format(score, filename2title[target_filename]))

Source: Forward-backward retraining of recurrent neural networks
--- top 10 similar docs ---
(0.01294) Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks
(0.01293) Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems
(0.01292) The Effect of Eligibility Traces on Finding Optimal Memoryless Policies in Partially Observable Markov Decision Processes
(0.01292) Learning Macro-Actions in Reinforcement Learning
(0.01292) Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System
(0.01291) Hippocampal Model of Rat Spatial Abilities Using Temporal Difference Learning
(0.01291) Reinforcement Learning for Mixed Open-loop and Closed-loop Control
(0.01291) MELONET I: Neural Nets for Inventing Baroque-Style Chorale Variations
(0.01289) On-line Policy Improvement using Monte-Carlo Search
(0.01289) How to Dynamically Merge Markov Decision Processes
