# DIGI405-20S1 - Topic Modeling TED.com transcripts

This notebook introduces Gensim for topic modeling. The 2018 TED.com transcripts are also available for download on the datasets page on Learn if you wish to train models using TMT.

Work through the notebook. The key things to do are:
1. to try training some different size models (e.g. 10 topics, 30 topics, 50 topics);  
2. to explore the topic assignments for documents and assess the quality of topics returned; 
3. to measure 'c_v' topic coherence for a number of models;
3. to make notes on your observations of different models and the kinds of similarities between documents they produce.

Since we need to evaluate topic models against a use case - think about the idea of a recommendation engine: what model performs best for finding similiar TED talks?

In [None]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models.wrappers import LdaMallet
from gensim.models.coherencemodel import CoherenceModel
from gensim import similarities

import os.path
import re
import glob

import pandas as pd
import matplotlib as plt

mallet_path = '/opt/mallet-2.0.8/bin/mallet' # this should be the correct path for the DIGI405 lab workrooms

In [None]:
# to install nltk run this in Anaconda prompt: pip install nltk 
# note if you get an error with stopwords below then uncomment the following lines and rerun this cell 
# import nltk
# nltk.download('stopwords')

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

The following cells contain functions to load a corpus from a directory of text files, preprocess the corpus and create the bag of words document-term matrix. 

In [None]:
def load_data_from_dir(path):
    file_list = glob.glob(path + '/*.txt')

    # create document list:
    documents_list = []
    for filename in file_list:
        with open(filename, 'r', encoding='utf8') as f:
            text = f.read()
            f.close()
            documents_list.append(text)
    print("Total Number of Documents:",len(documents_list))
    return documents_list

In [None]:
def preprocess_data(doc_set,extra_stopwords = {}):
    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
    # replace all newlines or multiple sequences of spaces with a standard space
    doc_set = [re.sub('\s+', ' ', doc) for doc in doc_set]
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    # add any extra stopwords
    if (len(extra_stopwords) > 0):
        en_stop = en_stop.union(extra_stopwords)
    
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        # add tokens to list
        texts.append(stopped_tokens)
    return texts

In [None]:
def prepare_corpus(doc_clean):
    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    dictionary = corpora.Dictionary(doc_clean)
    
    dictionary.filter_extremes(no_below=5, no_above=0.5)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

    return dictionary,doc_term_matrix

## Load and pre-process the corpus
Load the corpus, preprocess with additional stop words and output dictionary and document-term matrix.

In [None]:
# adjust the path below to wherever you have the transcripts2018 folder
document_list = load_data_from_dir("data/transcripts2018/")

# I've added extra stopwords here in addition to NLTK's stopword list - you could look at adding others.
doc_clean = preprocess_data(document_list,{'laughter','applause'})

dictionary, doc_term_matrix = prepare_corpus(doc_clean)

## LDA model with 20 topics
The following cell sets the number of topics we are training the model for. The one after trains the model and outputs the topics. Note: this can take a while!

In [None]:
number_of_topics=20 # adjust this to alter the number of topics
words=20 #adjust this to alter the number of words output for the topic below

In [None]:
# runs LDA using Mallet from gensim using the number_of_topics specified above - this might take a couple of minutes
# you can create additional variables eg ldamallet20 to store models with different numbers of topics
ldamallet20 = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=number_of_topics, id2word=dictionary)

In [None]:
# output the topics
ldamallet20.show_topics(num_topics=number_of_topics,num_words=words)

## Convert to Gensim model format
Convert the Mallet model to gensim format.

In [None]:
gensimmodel20 = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet20)

## Get a coherence score

In [None]:
coherencemodel = CoherenceModel(model=gensimmodel20, texts=doc_clean, dictionary=dictionary, coherence='c_v')
print (coherencemodel.get_coherence())

## Calculate coherence scores for models with different numbers of topics

You should create some further models below to test different numbers of topics.

For example, the code block below will train a model with 30 topics and return the coherence score. You can duplicate this and change the number of topics and the variable names to keep track of different models.

In [None]:
ldamallet30 = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=30, id2word=dictionary)
gensimmodel30 = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet30)
coherencemodel = CoherenceModel(model=gensimmodel30, texts=doc_clean, dictionary=dictionary, coherence='c_v')
print (coherencemodel.get_coherence())

## Test a range of topic sizes and plot the results

**Important**: this process will take a while to run! Make sure you have tried a number topic sizes to get a sense of what models you need to test. I suggest you test no more than 8-10 models using the code below, so as not to be waiting too long!

In [None]:
# supply values for k and the interval, eg 20, 60, 10 will train models for 20, 30, 40, 50, and 60 topics
min_k = 
max_k = 
intervals = 

coherences = {}

for i in range(min_k, max_k, intervals):
    ldamalletmodel = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=i, id2word=dictionary)
    gensimmodel = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamalletmodel)
    coherences[i] = CoherenceModel(model=gensimmodel, texts=doc_clean, dictionary=dictionary, coherence='c_v').get_coherence()

In [None]:
# convert the coherence scores to a pandas dataframe
df = pd.DataFrame.from_dict(coherences, orient='index', columns=['Coherence'])
df['Topics'] = df.index

In [None]:
# plot the result
df.plot(kind='line', x='Topics', y='Coherence')

## Preview a document

Preview a document - you can change the doc_id to view another document.

In [None]:
doc_id = 10 # index of document to explore
print(re.sub('\s+', ' ', document_list[doc_id])) 

## Output the distribution of topics for the document

The next cell outputs the distribution of topics on the document specified above.

In [None]:
document_topics = gensimmodel20.get_document_topics(doc_term_matrix[doc_id])
document_topics = sorted(document_topics, key=lambda x: x[1], reverse=True) # sorts document topics

for topic, prop in document_topics:
    topic_words = [word[0] for word in gensimmodel20.show_topic(topic, 10)]
    print ("%.2f" % prop, topic, topic_words)

## Find similar documents
This will find the 5 most similar documents to the document specified above based on their topic distribution. The MatrixSimilarity() method uses cosine similarity to measure how similar the document specified by `docid` is to all other documents for that model. There are better measures, but this one is quick and simple to implement.

In [None]:
lda_index = similarities.MatrixSimilarity(doc_term_matrix)
 
# query for our doc_id from above
similarity_index = lda_index[doc_term_matrix[doc_id]]
# Sort the similarity index
similarity_index = sorted(enumerate(similarity_index), key=lambda item: -item[1])

for i in range(1,6): 
    document_id, similarity_score = similarity_index[i]
    print('Document Index: ',document_id)
    print('Similarity Score',similarity_score)
    print(re.sub('\s+', ' ', document_list[document_id][:500]), '...') # preview first 500 characters
    print()