# Topic Modelling

Now we get into the topic modelling, which will be using the LDA method.

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords

In [None]:
# import the pre-processed csv file 
df = pd.read_csv('data/data_for_tm.csv')

### some filtering (if needed)

We need to filter by decade now for each topic model, since we want to evaluate for a specific time frame 

we made three other separate files that filter based on year for topic modelling, since it usually takes a bit longer to run than the colocates (so we can run them concurrently)

the filters for new york times and pittsburg gazette are also below 

In [None]:
# filter by decade 
# df = df.loc[df['year'] < 1990] # for 1980-1989
# df = df.loc[(df['year'] > 1989) & (df['year'] < 2000)] # for 1990-1999
# df = df.loc[(df['year'] > 1999) & (df['year'] < 2010)] # for 2000-2009

In [None]:
# filter by publisher (specifically NYT and pittsburg)
# df = df.loc[df['publisher'] == 'The New York Times'] # 17833 articles 
# df = df.loc[df['publisher'] == 'Pittsburgh Post-Gazette (Pennsylvania)'] # 3229 articles

In [None]:
from gensim.utils import simple_preprocess

# change the cleaned text into a list
data = df['clean_text'].values.tolist()

# data_word lists 
data_words = [[word for word in simple_preprocess(str(doc))] for doc in data]

print(data_words[:1][0][:30])

In [None]:
from gensim import corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)

print(id2word)

# Create Corpus: Term Document Frequency
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0][:30])

## Coherence Score

So this next part is only run once to see what the coherence scores are based on different number of topics. It will help give us a better idea of what might be an ideal number of topics, though we would still need to manually verify and also play around with the number of topics. Based on both coherence store and qualitative observation of the granuarlity and coherence of individual topics, we can decide how many topics to proceed with.

In [None]:
# calculate coherence score
from gensim.models import CoherenceModel
from gensim.models import LdaModel

# Compute coherence score

# it would take forever to run each and every variation of topic number and find coherence score
# so we have pre-selected certain topics numbers, with more on the lowerside, and segmenting as we go higher
number_of_topics = 40 # [5,10,15,20,30,40,50]
coherence_score = []
# for i in number_of_topics:
lda_model_c = LdaModel(corpus=corpus,
                      id2word=id2word,
                      alpha='auto', 
                      eta='auto', 
                      passes=10, 
                      iterations=50, 
                      random_state=42,
                      num_topics=number_of_topics)
coherence_model_lda = CoherenceModel(model=lda_model_c, 
                                      texts=texts, 
                                      dictionary=id2word, 
                                      coherence='c_v',
                                      topn=30)
coherence_lda = coherence_model_lda.get_coherence()
# number_of_topics.append(i)
coherence_score.append(coherence_lda)

# Create a dataframe of coherence score by number of topics 
topic_coherence = pd.DataFrame({'number_of_topics':number_of_topics,
                                'coherence_score':coherence_score})

# we have a table of the different coherence values
topic_coherence

coherence scores:

- 5 topics: 0.496398 (9 mins)
- 20 topics: 0.604874 (11 mins)
- 30 topics: 0.567641 (15 mins)
- 40 topics: 0.543457 (31 mins)

In [None]:
# plot these coherence scores
topic_plot = topic_coherence.plot.line(x='number_of_topics', y='coherence_score')

topic_plot

## LDA model

Once we have decided above with a good number of topics, we can run the LDA model here below.

In [None]:
from gensim.models import LdaModel
# Define the number of topics 
n_topics = 30

# Run the LDA model
lda_model = LdaModel(corpus=corpus,
                        id2word=id2word,
                        alpha='auto', 
                        eta='auto', 
                        passes=10, 
                        iterations=500, 
                        random_state=42,
                        num_topics=n_topics)

print("lda_model finished.")

In [None]:
# list of top 30 words with their probability weights
print("lda_model, with top 30 words: ")
for idx, topic in lda_model.print_topics(num_topics=n_topics, num_words=30):
    print("Topic: {} Word: {}".format(idx, topic))

In [None]:
# Import and enable notebook to run visualization
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
 
vis = pyLDAvis.gensim_models.prepare(lda_model, 
                                     corpus, 
                                     dictionary=lda_model.id2word,
                                     mds='mmds',
                                     sort_topics=False)

pyLDAvis.save_html(vis, 'lda_model_nyt_30_topics.html')