# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [7]:
import os
from collections import defaultdict

import gensim
import nltk

from gensim import models

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/timon/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /Users/timon/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [4]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [6]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    # article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [10]:
model_tfidf = models.TfidfModel(corpus_bow)

corpus_tfidf = model_tfidf[corpus_bow]
print(corpus_bow[0])
print(corpus_tfidf[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 3), (46, 1), (47, 1), (48, 2), (49, 2), (50, 3), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 8), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 5), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1), (101, 3), (102, 1), (103, 1), (104, 1), (105, 4), (106, 2), (107, 1), (108, 1), (109, 1), (110, 1)]

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [12]:
model_lda = models.LdaModel(
    corpus=corpus_bow,
    id2word=dictionary,
    num_topics=10,
)

Let's inspect the first 5 topics of our model.

In [13]:
model_lda.print_topics(5)

[(3,
  '0.012*"australia" + 0.009*"mr" + 0.007*"australian" + 0.005*"two" + 0.005*"year" + 0.005*"world" + 0.005*"india" + 0.005*"also" + 0.005*"police" + 0.004*"last"'),
 (7,
  '0.015*"mr" + 0.007*"us" + 0.005*"afghanistan" + 0.005*"people" + 0.005*"bin" + 0.005*"states" + 0.005*"united" + 0.004*"two" + 0.004*"australian" + 0.004*"laden"'),
 (9,
  '0.008*"government" + 0.007*"new" + 0.007*"mr" + 0.007*"us" + 0.006*"afghanistan" + 0.006*"people" + 0.005*"south" + 0.005*"qantas" + 0.004*"australia" + 0.004*"two"'),
 (5,
  '0.007*"mr" + 0.007*"palestinian" + 0.006*"new" + 0.006*"people" + 0.005*"australian" + 0.005*"would" + 0.004*"fire" + 0.004*"police" + 0.004*"government" + 0.004*"last"'),
 (2,
  '0.011*"mr" + 0.007*"palestinian" + 0.005*"arafat" + 0.005*"israeli" + 0.005*"australia" + 0.004*"minister" + 0.004*"test" + 0.004*"new" + 0.004*"security" + 0.004*"first"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [14]:
from gensim import similarities

index = similarities.MatrixSimilarity(model_lda[corpus_tfidf])

Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [None]:
doc = "Human computer interaction"
# TODO: preprocessing before doc2bow conversion
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)



Print the top 5 most similar documents, together with their similarities, using your index created above.

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?