## Introduction to Computational Social Science methods with Python

### Natural Language Processing - Topic Modeling

<div class='alert alert-block alert-success'>
<b>In this Python notebook</b>, 

In this Python notebook, we will explore how to perform topic modeling using Latent Dirichlet Allocation (LDA) on thr preprocessed dataset of news articles. Topic modeling is a critical task in natural language processing (NLP), which involves discovering latent topics in a corpus of text data. LDA is a popular topic modeling technique that represents documents as mixes of topics, where each topic is a probability distribution over words in the vocabulary.

By the end of this notebook, you will have a basic understanding of how to perform topic modeling using LDA on text data, and how to analyze and visualize the discovered topics for insights and understanding. Let's get started!
</div>

## A. Latent Dirichlet Allocation (LDA)

**Latent Dirichlet Allocation (LDA)** is a generative probabilistic model which is generally used for topic modeling. 

**Topic modeling** is a technique in natural language processing (NLP) used to uncover underlying themes in a large corpus of text. The goal of topic modeling is to identify the most important topics in a collection of documents and to extract the most important words and phrases associated with each topic. There are several popular topic modeling algorithms, including Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). These algorithms are **unsupervised**, meaning that they do not require labeled data. On the contrary, they are based on the assumption that words that frequently occur together in a document are likely to be associated with the same topic.

Topic modeling typically involves the following steps:

- Preprocessing the data: The text data is cleaned and preprocessed to remove unnecessary information such as stop words, punctuation, and special characters.
- Vectorization: the preprocessed text data is converted into a numerical vector representation, such as a bag-of-words or TF-IDF (Term Frequency-Inverse Document Frequency) matrix.
- Model training: a topic modeling algorithm is applied to the vectorized text data to identify the main topics or themes. This includes identifying the words or phrases most strongly associated with each topic.
- Theme Interpretation: the resulting themes are interpreted by examining the most representative words or phrases for each theme. This may involve manually examining the most important words for each theme or using automated techniques to summarize the themes.

LDA assumes that each document is a mixture of different topics, and each topic is a probability distribution over words. The algorithm takes a collection of documents as input and returns a set of topics, each represented by a probability distribution over words.

The basic idea behind LDA is that each document is represented as a distribution of topics and each topic is represented as a distribution of words. The algorithm assumes that the documents are generated in the following way:

- For each document, a distribution of topics is selected from a Dirichlet distribution.
- For each word in the document:
    - Select a topic from the distribution of topics of the document.
    - Select a word from the distribution of words for the selected topic.

The algorithm iteratively learns the topic distributions for each document and the word distributions for each topic by maximizing the probability of the observed data. This optimization is usually done using the expectation maximization algorithm (EM).

Once the algorithm has converged, it returns a set of topics, each represented by a distribution over words. These themes can be interpreted by examining the most likely words for each theme. 

Here we present an example of LDA applied to a set of documents. We will use th news articles that we preprocessed previous notebook. Let's import the TF-IDF matrix that we previously extracted from this corpus:

In [1]:
import pandas as pd
import numpy as np 
import pickle as pkl

# import raw dataset 
news = pd.read_csv("../data/news_subset.csv")

# import also the dictionary, the preprocessed corpus, and the BoW and TF-IDF matrices
with open("./output/tf_idf_gensim.pkl", "rb") as file:
    tf_idf = pkl.load(file)

with open("./output/dict_gensim.pkl", "rb") as file:
    dct = pkl.load(file)

with open("./output/corpus.pkl", "rb") as file:
    corpus = pkl.load(file)

with open("./output/document_term_matrix.pkl", "rb") as file:
    corpus_bow = pkl.load(file)

Then, we run LDA on the document-term matrix. We use the Gensim implementation of LDA:

In [None]:
import gensim
from gensim.models import LdaModel

# train an LDA model on the TF-IDF corpus
num_topics = 10
lda_model = LdaModel(corpus_bow,
                     id2word=dct, 
                     num_topics=num_topics, 
                     random_state=100, 
                     update_every=1, 
                     chunksize=100, 
                     passes=10, 
                     alpha='auto', 
                     per_word_topics=True)

We can visualize the extracted topics by looking at the 10 most common words of each topic:

In [None]:
# print the topics and associated keywords
for topic in lda_model.print_topics():
    print(topic)

To assess how good are the topics extracted by the model, we can use different metrics such as the coherence score and perplexity.

**Coherence score** is a measure of how coherent the topics generated by a topic model are, based on the co-occurrence of words in the corpus. It is often used as an evaluation metric for topic models, in addition to perplexity.
The coherence score is typically based on the top N words in each topic, and measures the similarity between pairs of words in the same topic. There are different ways to define the coherence score, but one common approach is to use the Pointwise Mutual Information (PMI) between pairs of words.

PMI measures the degree of association between two words, based on the probability of their co-occurrence in the corpus relative to their individual probabilities. A high PMI score indicates that the two words are strongly associated and likely to appear together in the same context.

The coherence score is calculated as the average PMI score over all pairs of words in each topic, and then averaged over all topics in the model. A higher coherence score indicates that the topics are more coherent and contain more related and meaningful words.
Coherence score is often used in combination with perplexity to evaluate the quality of a topic model. Perplexity measures how well the model fits the data, while coherence measures how well the topics generated by the model make sense in terms of the co-occurrence of words in the corpus.

We can also use coherence score to calibrate the number of topics to use, by training several models and comparing them to each other using the coherence score. The model with highest coherence score is then used. 



In [None]:
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus, dictionary=dct)
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score with {0} topics: {1}'.format(num_topics, coherence_lda))


In [None]:
#scores = []
#for num_topics in np.arange(5, 20):

    # fit LDA model
    #lda_model = LdaModel(corpus_bow, id2word=dct, num_topics=num_topics)

    # compute Coherence Score
    #coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus, dictionary=dct)
    #coherence_lda = coherence_model_lda.get_coherence()
    #print('\nCoherence Score with {0} topics: {1}'.format(num_topics, coherence_lda))

    #scores.append(coherence_lda)

#import matplotlib.pyplot as plt
#plt.plot(np.arange(5, 20), scores, marker="o")
#plt.xlabel("Number of topics")
#plt.ylabel("Coherence Score")

#print("Max coherence with: {0} topics".format(np.arange(5, 20)[np.argmax(scores)]))

Finally, using the package pyLDAvis we can explore the extracted topis:

In [3]:
import pyLDAvis
import pyLDAvis.gensim
# !pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install --user

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus_bow, dct)
vis

  default_term_info = default_term_info.sort_values(
