# Introduction to Topic Modeling and Implementing Topic Modeling Techniques

Topic Modeling is a type of statistical modeling used to uncover hidden structure in a collection of texts. In simpler terms, it is a way to find the main topics that emerge from a large set of documents. Topic modeling is part of a larger group of algorithms known as 'unsupervised learning'. 'Unsupervised' because we don't provide the algorithm with predefined labels, it finds structure on its own, and 'learning' because it gets better and better as it processes more data.

In communication research, Topic Modeling is often used to identify themes or discourses in a large collection of documents, such as newspaper articles or social media posts. For example, a researcher might be interested in understanding the primary narratives around climate change on Twitter, or the main themes in news coverage of an election. 

## LDA - Latent Dirichlet Allocation

There are several algorithms for Topic Modeling such as Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA). In this lesson, we'll focus on Latent Dirichlet Allocation (LDA), which is one of the most popular techniques for Topic Modeling.

LDA assumes that every document is a mixture of topics and each topic is a mixture of words. This assumption helps us find topics which are nothing but a bunch of words ordered with a probability that defines how important a word is, for that particular topic.

# Examples of Communication Research Questions That Can Be Answered by Topic Modeling

1. **News Media:** What are the primary topics covered by a news outlet during a certain period? Does the focus of topics change over time?

2. **Social Media:** What are the main themes in public discourse about a particular issue on social media platforms like Twitter, Facebook, or Reddit?

3. **Political Speeches:** What topics do politicians focus on in their speeches? Does the focus change based on the political context?

4. **Public Opinions:** What are the primary concerns of the public about a particular issue, as reflected in letters to the editor, public comments on news websites, or online discussion forums?

# Types of Data That Can Be Analyzed by Topic Modeling

Virtually any type of text data can be analyzed with topic modeling. This includes, but is not limited to:

1. **News Articles:** Topic modeling can help discover the main themes in a large collection of news articles.

2. **Social Media Posts:** Topic modeling can be used to understand public discourse on social media platforms.

3. **Political Speeches:** Analyze the content of speeches to understand the primary themes.

4. **Research Papers:** Discover the main research themes in a set of academic papers.

The applications of topic modeling are vast and varied. In your final project for this course, you might consider how topic modeling could help answer your research questions.


## Load and Preprocess the Data

Suppose we have a dataset of tweets related to mental health. We'll load this data and then preprocess it by removing stop words and converting all text to lowercase:

In [1]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Using cached regex-2023.5.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)
Collecting click
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Installing collected packages: regex, click, nltk
Successfully installed click-8.1.3 nltk-3.8.1 regex-2023.5.5


In [2]:
# Import the necessary packages
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
import re
import nltk
from nltk.corpus import stopwords
# Make sure you have the stop words package downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Load data
df = pd.read_csv('tweets.csv')

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove RT (retweet sign)
    text = re.sub(r'rt[\s]+', '', text)
    # Remove mentions
    text = re.sub(r'@\S+', '', text)
    # Remove all non-alphabetic characters
    text = re.sub(r'\W', ' ', text)
    # Remove single characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
    # Remove leading and trailing whitespaces
    text = text.strip()
    # Remove stopwords and do stemming
    # text = " ".join([stemmer.stem(i) for i in text.split() if i not in stop_words])
    return text

df['text_cleaned'] = df['post_text'].apply(preprocess_text)

In [4]:
df[['text_cleaned', 'post_text']]

Unnamed: 0,text_cleaned,post_text
0,it just over 2 years since was diagnosed with ...,It's just over 2 years since I was diagnosed w...
1,it sunday need break so m planning to spend a...,"It's Sunday, I need a break, so I'm planning t..."
2,awake but tired need to sleep but my brain has...,Awake but tired. I need to sleep but my brain ...
3,retro bears make perfect gifts and are great f...,RT @SewHQ: #Retro bears make perfect gifts and...
4,it hard to say whether packing lists are makin...,It’s hard to say whether packing lists are mak...
...,...,...
19995,a day without sunshine is like night,A day without sunshine is like night.
19996,boren laws 1 when in charge ponder 2 wh...,"Boren's Laws: (1) When in charge, ponder. (2) ..."
19997,the flow chais most thoroughly oversold piece ...,The flow chart is a most thoroughly oversold p...
19998,ships are safe in harbor but they were never ...,"Ships are safe in harbor, but they were never ..."


In [5]:
# We will only use the 'text_cleaned' column for our analysis
documents = df['text_cleaned']

# Initialize CountVectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(documents)

## Construct the Topic Model

Now we're ready to construct a topic model:

In [21]:
# Tweak the two parameters below (use int values below 15)
number_topics = 3
number_words = 10

# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)

## View the Topics

Lastly, let's view the topics:

In [15]:
words = list(count_vectorizer.get_feature_names_out())
len(words)

19960

In [18]:
lda.components_.shape

(5, 19960)

In [22]:
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % (topic_idx+1))
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

print_topics(lda, count_vectorizer, number_words)


Topic #1:
depression thank new love twitter yong treatments say thanks following

Topic #2:
just know want don think really best im life people

Topic #3:
like user just trump amp don good need feel time


This should display a list of topics, each represented as a list of words. Understanding these topics can be very useful for communication research. For instance, it can help identify how public discourse around a given topic changes over time, or detect different narratives in public discourse.

Remember, the choice of the number of topics (`number_topics`) and the number of top words (`number_words`) can significantly influence the results. So, feel free to experiment with these values to see how the topics change.

## Visualization

### 1. Topic distribution across documents:

Preparing the LDA model's output (topic distribution for each document) to be visualized using t-SNE, which is a technique for reducing the dimensionality of data (specifically, it's used for visualizing high-dimensional data in 2 or 3 dimensions).

In [23]:
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
from sklearn.manifold import TSNE

# Get topic weights
topic_weights = []
for i, row_list in enumerate(lda.transform(count_data)):
    topic_weights.append([w for i, w in enumerate(row_list)])

# Array of topic weights    
arr = pd.DataFrame(topic_weights).fillna(0).values

# Keep the well separated points (optional)
arr = arr[np.amax(arr, axis=1) > 0.35]

# Dominant topic number in each doc
topic_num = np.argmax(arr, axis=1)

# tSNE Dimension Reduction
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
tsne_lda = tsne_model.fit_transform(arr)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 19853 samples in 0.008s...
[t-SNE] Computed neighbors for 19853 samples in 0.523s...
[t-SNE] Computed conditional probabilities for sample 1000 / 19853
[t-SNE] Computed conditional probabilities for sample 2000 / 19853
[t-SNE] Computed conditional probabilities for sample 3000 / 19853
[t-SNE] Computed conditional probabilities for sample 4000 / 19853
[t-SNE] Computed conditional probabilities for sample 5000 / 19853
[t-SNE] Computed conditional probabilities for sample 6000 / 19853
[t-SNE] Computed conditional probabilities for sample 7000 / 19853
[t-SNE] Computed conditional probabilities for sample 8000 / 19853
[t-SNE] Computed conditional probabilities for sample 9000 / 19853
[t-SNE] Computed conditional probabilities for sample 10000 / 19853
[t-SNE] Computed conditional probabilities for sample 11000 / 19853
[t-SNE] Computed conditional probabilities for sample 12000 / 19853
[t-SNE] Computed conditional probabilities for sam

In [24]:
!pip install bokeh

Collecting bokeh
  Using cached bokeh-3.1.1-py3-none-any.whl (8.3 MB)
Collecting xyzservices>=2021.09.1
  Using cached xyzservices-2023.5.0-py3-none-any.whl (56 kB)
Collecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Installing collected packages: xyzservices, tzdata, bokeh
Successfully installed bokeh-3.1.1 tzdata-2023.3 xyzservices-2023.5.0


In [25]:
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
# Plot the Topic Clusters using Bokeh
output_notebook()
mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
plot = figure(title="t-SNE Clustering of {} LDA Topics".format(number_topics), 
              width=900, height=700)
plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
show(plot)

### 2. Intra-topic distance visualization: [Link](https://nbviewer.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb#topic=51&lambda=1&term=)