## Topic Modeling

We will build a topic model with Latent Dirichlet Allocation method. First, we will prepare our data with tokenization and preprocessing. Then, we will build a model and display its output.

In [19]:
import pandas as pd

import nltk
import pickle
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer

from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel

Here we prepare our stopword list for subsequent filtering. We use nltk's stopword list for English and add our corpus-specific terms to it. Add your own words to the list as suggested by the comment.

In [20]:
nltk.download('stopwords', quiet=True)
en_stop = set(nltk.corpus.stopwords.words('english'))
# add additional stopwords that are relevant for the corpus
en_stop.update(['tourism', 'tourist', 'innovation', 'research', 'study',
                'paper'])

Method for tokenization. We use regex tokenizer that uses words only. Punctuation will be discarded. Also note that this method will split words like _let's_ into _let_ and _s_ and omit special signs, such as #. It will keep the digits.

In [21]:
def tokenize(text):
    # tokenize by words only, e.g. "The cat chased its tail." outputs ["The", "cat", "chased", "its", "tail"]
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = [w.lower() for s in sent_tokenize(text) for w in tokenizer.tokenize(s)]
    return [token for token in tokens if token not in en_stop and len(token)> 4]

Read the file that we wish to use as corpus. Replace the file with your own if you wish to use another data set instead of Innovation.csv.

col defines the name of the column that will be used for analysis. Here we use 'Title' for speed, but you can also use 'Content' or 'Abstract'. As a matter of fact, you can use any column from data (printed in the section below), but it has to contain text.

In [31]:
df = pd.read_csv('Innovation/Innovation.csv')
print(list(df))
#set the column name to model by
col = 'Title'

['Title', 'Authors', 'PublicationName', 'Type', 'Abstract', 'Content', 'Volume', 'Issue', 'Date', 'Pages', 'PII', 'Keywords', 'URL', 'OpenAccess', 'References', 'CitedBy', 'AuthorAUID', 'AuthKeywords', 'SubjectAreas']


In [23]:
text_data = []
for index, row in df.iterrows():
    if pd.isna(row[col]):
        text_data.append([])
    else:
        tokens = tokenize(row[col])
        text_data.append(tokens)

In [24]:
dictionary = Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]

In [25]:
pickle.dump(corpus, open('Innovation/innov_corpus.pkl', 'wb'))
dictionary.save('Innovation/innov_dictionary.gensim')

Define the number of topics you wish to retrieve with LDA. Suggested number of topics should be between 3 and 10, as more than that are difficult to interpret.

In [26]:
NUM_TOPICS = 5
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('Innovation/model5.gensim')

Inspect the results of topic modeling.

In [35]:
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print("Topic {}".format(int(topic[0]) + 1) + ". Relevant terms: {}".format(topic[1]))

Topic 1. Relevant terms: 0.017*"sustainable" + 0.015*"development" + 0.012*"management" + 0.011*"based"
Topic 2. Relevant terms: 0.013*"development" + 0.008*"small" + 0.007*"chapter" + 0.006*"local"
Topic 3. Relevant terms: 0.012*"chapter" + 0.011*"future" + 0.011*"economic" + 0.010*"review"
Topic 4. Relevant terms: 0.016*"development" + 0.014*"change" + 0.010*"climate" + 0.010*"education"
Topic 5. Relevant terms: 0.035*"chapter" + 0.020*"management" + 0.015*"performance" + 0.014*"marketing"
