# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [2]:
import os
from collections import defaultdict

import gensim
import nltk

from gensim import models

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/timon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/timon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [4]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [5]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    # article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [6]:
model_tfidf = models.TfidfModel(corpus_bow)

corpus_tfidf = model_tfidf[corpus_bow]
print(corpus_bow[0])
print(corpus_tfidf[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 3), (46, 1), (47, 1), (48, 2), (49, 2), (50, 3), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 8), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 5), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1), (101, 3), (102, 1), (103, 1), (104, 1), (105, 4), (106, 2), (107, 1), (108, 1), (109, 1), (110, 1)]

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [7]:
model_lda = models.LdaModel(
    corpus=corpus_bow,
    id2word=dictionary,
    num_topics=10,
)

Let's inspect the first 5 topics of our model.

In [8]:
model_lda.print_topics(5)

[(9,
  '0.008*"palestinian" + 0.007*"mr" + 0.006*"south" + 0.005*"us" + 0.005*"bin" + 0.005*"new" + 0.005*"laden" + 0.004*"taliban" + 0.004*"one" + 0.004*"also"'),
 (8,
  '0.007*"mr" + 0.005*"last" + 0.005*"us" + 0.005*"people" + 0.005*"one" + 0.004*"south" + 0.004*"australia" + 0.004*"two" + 0.004*"new" + 0.004*"would"'),
 (1,
  '0.009*"mr" + 0.007*"new" + 0.007*"australia" + 0.007*"people" + 0.006*"australian" + 0.005*"government" + 0.004*"would" + 0.004*"say" + 0.004*"us" + 0.004*"per"'),
 (0,
  '0.009*"mr" + 0.007*"people" + 0.006*"australian" + 0.005*"palestinian" + 0.005*"australia" + 0.005*"new" + 0.005*"us" + 0.005*"also" + 0.005*"two" + 0.005*"police"'),
 (5,
  '0.011*"mr" + 0.008*"government" + 0.008*"australian" + 0.006*"afghanistan" + 0.005*"south" + 0.005*"first" + 0.005*"new" + 0.005*"australia" + 0.005*"security" + 0.005*"palestinian"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [9]:
from gensim import similarities

index = similarities.MatrixSimilarity(model_lda[corpus_tfidf])

Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [10]:
query = "Human computer interaction"
pre_query = preprocess(query)

vec_bow = dictionary.doc2bow(pre_query)
vec_lsi = model_lda[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, 0.05000383), (1, 0.050005518), (2, 0.050006784), (3, 0.050003476), (4, 0.050005294), (5, 0.050003637), (6, 0.050004765), (7, 0.05000939), (8, 0.54995304), (9, 0.05000425)]


Print the top 5 most similar documents, together with their similarities, using your index created above.

In [11]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims[:5]:
    print(doc_score, articles[doc_position])

[(0, 0.12362331), (1, 0.10999815), (2, 0.11755372), (3, 0.10955433), (4, 0.11260693), (5, 0.108977884), (6, 0.68706435), (7, 0.11419714), (8, 0.14642076), (9, 0.12079371), (10, 0.11103848), (11, 0.112090275), (12, 0.10658221), (13, 0.15727511), (14, 0.114781186), (15, 0.11308652), (16, 0.1168414), (17, 0.15305422), (18, 0.1179017), (19, 0.11460081), (20, 0.11931464), (21, 0.979245), (22, 0.1176006), (23, 0.11233145), (24, 0.115853705), (25, 0.14794421), (26, 0.97615093), (27, 0.11531561), (28, 0.97474885), (29, 0.110181294), (30, 0.115607515), (31, 0.11005247), (32, 0.12687908), (33, 0.0877205), (34, 0.11033183), (35, 0.11082268), (36, 0.12687851), (37, 0.0877205), (38, 0.97555023), (39, 0.110385455), (40, 0.0877196), (41, 0.11470556), (42, 0.110219136), (43, 0.975523), (44, 0.10901283), (45, 0.11098352), (46, 0.14400078), (47, 0.11189617), (48, 0.10598667), (49, 0.13254867), (50, 0.112762526), (51, 0.11673851), (52, 0.10686405), (53, 0.11523597), (54, 0.97629166), (55, 0.10875548), (5

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?