# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [3]:
import os
from collections import defaultdict

import gensim
import nltk

from gensim import models

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [4]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/timon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/timon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [5]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [6]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    # article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [7]:
model_tfidf = models.TfidfModel(corpus_bow)

corpus_tfidf = model_tfidf[corpus_bow]
print(corpus_bow[0])
print(corpus_tfidf[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 3), (46, 1), (47, 1), (48, 2), (49, 2), (50, 3), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 8), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 5), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1), (101, 3), (102, 1), (103, 1), (104, 1), (105, 4), (106, 2), (107, 1), (108, 1), (109, 1), (110, 1)]

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [27]:
model_lda = models.LdaModel(
    corpus=corpus_bow,
    id2word=dictionary,
    num_topics=50,
    passes=20,
    iterations=400
)

Let's inspect the first 5 topics of our model.

In [28]:
model_lda.print_topics(5)

[(19,
  '0.015*"palestinian" + 0.014*"government" + 0.012*"two" + 0.011*"us" + 0.011*"israeli" + 0.010*"killed" + 0.009*"fire" + 0.008*"forces" + 0.007*"northern" + 0.007*"security"'),
 (11,
  '0.031*"government" + 0.024*"hill" + 0.024*"afghanistan" + 0.021*"defence" + 0.017*"interim" + 0.016*"man" + 0.015*"senator" + 0.015*"minister" + 0.014*"australian" + 0.013*"force"'),
 (42,
  '0.018*"new" + 0.014*"space" + 0.012*"government" + 0.010*"road" + 0.010*"highway" + 0.010*"police" + 0.010*"people" + 0.010*"ses" + 0.008*"station" + 0.008*"launch"'),
 (21,
  '0.025*"pakistan" + 0.018*"indian" + 0.015*"india" + 0.013*"attack" + 0.010*"mr" + 0.010*"new" + 0.009*"best" + 0.009*"two" + 0.008*"would" + 0.008*"pakistani"'),
 (46,
  '0.027*"role" + 0.027*"minister" + 0.020*"yes" + 0.013*"sir" + 0.013*"heart" + 0.013*"civil" + 0.013*"attack" + 0.013*"died" + 0.013*"friend" + 0.013*"prime"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [29]:
from gensim import similarities

index = similarities.MatrixSimilarity(model_lda[corpus_tfidf])

Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [32]:
def get_lda_vector(text):
    pre_query = preprocess(text)

    vec_bow = dictionary.doc2bow(pre_query)
    tfidf = model_tfidf[vec_bow]
    return model_lda[tfidf]

Print the top 5 most similar documents, together with their similarities, using your index created above.

In [33]:
vec_lda = get_lda_vector("A new bill sparked massive protests in Israel, as it would massively limit the powers of the judiciary")

sims = index[vec_lda]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

sims = sorted(enumerate(sims), key=lambda item: item[1], reverse=True)
for doc_position, doc_score in sims[:5]:
    print(doc_score, articles_orig[doc_position])

[(0, 0.22882265), (1, 0.0), (2, 0.40528202), (3, 0.0), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.50820297), (9, 0.0), (10, 0.0), (11, 0.0), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.0), (20, 0.76394856), (21, 0.4007116), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.0), (29, 0.0), (30, 0.0), (31, 0.0), (32, 0.0), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.76394856), (39, 0.0), (40, 0.0), (41, 0.0), (42, 0.0), (43, 0.50820297), (44, 0.0), (45, 0.0), (46, 0.0), (47, 0.0), (48, 0.0), (49, 0.0), (50, 0.0), (51, 0.0), (52, 0.0), (53, 0.0), (54, 0.0), (55, 0.0), (56, 0.0), (57, 0.0), (58, 0.0), (59, 0.0), (60, 0.2527294), (61, 0.0), (62, 0.0), (63, 0.0), (64, 0.0), (65, 0.0), (66, 0.0), (67, 0.0), (68, 0.0), (69, 0.0), (70, 0.0), (71, 0.0), (72, 0.0), (73, 0.0), (74, 0.0), (75, 0.0), (76, 0.0), (77, 0.0), (78, 0.76394856), (79, 0.0), (80, 0.0), (81, 0.0), (82, 0.0), (83, 0.0), (84, 0.0), (85, 0.312107

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?