# Topic Models

*Notebook for COMP90042, Web search and Text Analysis*

*Copyright The University of Melbourne, 2018*

In this notebook we will use Gensim to train topic models on the Brown corpus. For this notebook, we will consider paragraphs as documents, for the sake of efficiency. In a real world scenario, you will probably deal with full documents instead. Nevertheless, the steps provided here also apply to documents as well.

Let's start by reading the Brown corpus as a list of paragraphs.

In [None]:
from nltk.corpus import brown
docs = list(brown.paras())

Now let's train a topic model on this data using Gensim. There are a range of models available but for this notebook we will stick to standard Latent Dirichlet Allocation (LDA).

Before we do that though, we need to preprocess the data a bit in order for it to be read by Gensim: 1) we will flatten each document into a single list; 2) build a dictionary mapping words to ids and 3) generate a bag-of-words representation for each document using the word ids.

In [None]:
import gensim as gs

flat_docs = [[w for s in d for w in s] for d in docs]
brown_dict = gs.corpora.dictionary.Dictionary(flat_docs)
bow_docs = [brown_dict.doc2bow(d) for d in flat_docs]

Now we are ready to train a topic model. While we could use EM for this, the standard way to train is to estimate full posterior distribution for the parameters. This is a complex procedure that is out of the scope of the module but luckily Gensim has this implemented so we can just treat it as a black box procedure.

Notice we give a few parameters to the model:

- The number of topics.
- The dictionary.
- How many passes in the training data. This relates to training algorithm but to simplify: more passes usually better but take longer.
- The random state, since training is not deterministic.

In [None]:
import numpy as np
ldamodel = gs.models.ldamodel.LdaModel(bow_docs, num_topics=10, id2word=brown_dict, 
                                       passes=20, random_state=np.random.RandomState(10))
print(ldamodel)

Let's now inspect the learned topics. To do this, we will print word lists for each topic and manually inspect if we can infer any meaning from these lists.

In [None]:
topics = ldamodel.print_topics(num_words=20)
print(topics)

Notice that for every topic we have a list of numbers/words. The numbers represent the probability of word appearing given the topic (check this). However, this output is hard to interpret so let's format to a more friendly format.

In [None]:
def pprint_topics(ldamodel, num_words=20):
    topics = ldamodel.print_topics(num_words=num_words)
    word_lists = [(t[0], t[1]) for t in topics]
    word_lists = [(t[0], [w.split('*')[1] for w in t[1].split(' + ')]) for t in word_lists]
    topic_ids = [t[0] for t in word_lists]
    word_lists = [' '.join([w[1:-1] for w in t[1]]) for t in word_lists]
    for t_id, w_list in zip(topic_ids, word_lists):
        print('%d:\t%s' % (t_id, w_list))

pprint_topics(ldamodel, num_words=20)

Hard to understand what's happening, right? This is because we did not do any preprocessing on the corpus. Similar to what we do in text classification, we should preprocess the corpus before training a topic model. Here though, the issue is much more evident as we end up with very uninterpretable topics.

So let's do some preprocessing steps. These might take a few seconds to run.

- lowercase words
- ignore punctuation
- remove stopwords
- lemmatise words

In [None]:
import nltk

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
stopwords = list(nltk.corpus.stopwords.words())

def preprocess_docs(corpus, stopwords):
    new_corpus = []
    for doc in corpus:
        new_doc = []
        for word in doc:
            if not word.isalpha():
                continue
            new_word = word.lower()
            if new_word in stopwords:
                continue
            new_word = lemmatize(new_word)
            new_doc.append(new_word)
        new_corpus.append(new_doc)
    return new_corpus


def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma
  

filtered_docs = preprocess_docs(flat_docs, stopwords)
brown_dict = gs.corpora.dictionary.Dictionary(filtered_docs)
bow_docs = [brown_dict.doc2bow(d) for d in filtered_docs]

Now let's train a new topic model on the filtered data and check what we come up with.

In [None]:
ldamodel = gs.models.ldamodel.LdaModel(bow_docs, num_topics=10, id2word=brown_dict, 
                                       passes=20, random_state=np.random.RandomState(10))
pprint_topics(ldamodel, num_words=30)

Much better now, right? Not every topic is 100% interpretable but some insights can be made. Notice that some words appear in more than one topic: this is expected in LDA (revisit the reading material and slides if you do not understand why). Can you find appropriate labels for some of the topics?

From here there are plenty of things you can experiment with. Check the Gensim website (https://radimrehurek.com/gensim/) for documentation and tutorials. Here a few suggestions:

- For visualisation, you can increase the number of words and/or try to come up with some filtering such as printing nouns and verbs only (how would you do that?).
- Inspect some documents in the corpus and check their topic distribution. You should check out methods in the Gensim API for that.
- Change the number of topics in LDA.
- Train on a different corpus, such as the Twitter samples.